dth981485742 2014-10-17 09:00
浏览 61

如何在MySQL列中找到最常提到的主题标签?

My database has a table messages with 3 fields: 'messageid',messagetext (varchar), dateposted (datetime)

I want to store a bunch of messages in the field messagetext along with their respective date of posting in the field dateposted. A lot of these messages will have hashtags in them.

Then, using PHP and MySQL I want to find out which hashtags are the top 5 most frequently mentioned hashtags in messages posted in the past week.

How can I do this? I'd really appreciate any help. Many thanks in advance.

  • 写回答

1条回答 默认 最新

  • duanping6698 2014-10-17 09:48
    关注

    Do not take me wrongly, but you've set up yourself for a world of hurt. The best way to proceed would be to follow lonesomeday's advice and parse hashtags at insert time. This also greatly reduces processing time as well as making it more deterministic (the workload is "spread" between inserts).

    If you want to proceed anyway, you need to tackle several problems.

    1) Recognize the tags.

    2) Multiple-select the tags. If you have a message saying that "#MySQL splitting is #cool", you want to get two rows from that one message, one saying 'MySQL', the other 'cool'.

    3) Selecting the appropriate messages

    4) Performances

    You can approach this in at least two ways. You can use a stored function , which you find here on SO (actually from this site) - you will have to modify it though.

    This syntax will get you the first occurrence of #hashtag in value plus all the text following it:

    select substring(value, LENGTH(substring_index(value, '#', 1))+1);
    

    You will then need to decide where, for each #hashtag, it #stops. (And it could be #parenthesized). At this point you need a regexp, or to search for a sequence of at least one alphanumeric character - in regexp parlance, [a-zA-Z0-9]+ - by specifying all possible characters or by using a loop, i.e. "#" is OK, "#t" is OK, "#ta" is OK, "#tag" is OK, "#tag," is not and so your hashtag is '#tag' (or 'tag').

    Another more promising approach is to use a user defined function to capture the hashtags; you can use PREG_CAPTURE.

    You will probably have to merge both approaches: modifying the stored function's setup and inner loop to read

    DECLARE cur1 CURSOR FOR SELECT messages.messagetext
                                     FROM messages
                                     WHERE messages.messagetext LIKE '%#%';
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
    
    DROP TEMPORARY TABLE IF EXISTS table2;
    CREATE TEMPORARY TABLE table2( hashtag` VARCHAR(255) NOT NULL
    

    ) ENGINE=Memory;

    ...
    
    SET occurrence = (SELECT LENGTH(msgtext)
                         - LENGTH(REPLACE(msgtext, '#, ''))
                         +1);
    SET i=1;
    WHILE i <= occurrence DO
      INSERT INTO table2 VALUES SELECT PREG_CAPTURE('/#([a-z0-9]+)/i', messagetext, occurrence));
      SET i = i + 1;
    
    END WHILE;
    ...
    

    This will return a list of message-ids and hashtags. You then need to GROUP them BY hashtag, count them and ORDER by count DESC, finally adding LIMIT 5 to only get the five most popular.

    评论

报告相同问题?

悬赏问题

  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入