Do not take me wrongly, but you've set up yourself for a world of hurt. The best way to proceed would be to follow lonesomeday's advice and parse hashtags at insert time. This also greatly reduces processing time as well as making it more deterministic (the workload is "spread" between inserts).
If you want to proceed anyway, you need to tackle several problems.
1) Recognize the tags.
2) Multiple-select the tags. If you have a message saying that "#MySQL splitting is #cool", you want to get two rows from that one message, one saying 'MySQL', the other 'cool'.
3) Selecting the appropriate messages
4) Performances
You can approach this in at least two ways. You can use a stored function , which you find here on SO (actually from this site) - you will have to modify it though.
This syntax will get you the first occurrence of #hashtag in value
plus all the text following it:
select substring(value, LENGTH(substring_index(value, '#', 1))+1);
You will then need to decide where, for each #hashtag, it #stops. (And it could be #parenthesized). At this point you need a regexp, or to search for a sequence of at least one alphanumeric character - in regexp parlance, [a-zA-Z0-9]+
- by specifying all possible characters or by using a loop, i.e. "#" is OK, "#t" is OK, "#ta" is OK, "#tag" is OK, "#tag," is not and so your hashtag is '#tag' (or 'tag').
Another more promising approach is to use a user defined function to capture the hashtags; you can use PREG_CAPTURE
.
You will probably have to merge both approaches: modifying the stored function's setup and inner loop to read
DECLARE cur1 CURSOR FOR SELECT messages.messagetext
FROM messages
WHERE messages.messagetext LIKE '%#%';
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
DROP TEMPORARY TABLE IF EXISTS table2;
CREATE TEMPORARY TABLE table2( hashtag` VARCHAR(255) NOT NULL
) ENGINE=Memory;
...
SET occurrence = (SELECT LENGTH(msgtext)
- LENGTH(REPLACE(msgtext, '#, ''))
+1);
SET i=1;
WHILE i <= occurrence DO
INSERT INTO table2 VALUES SELECT PREG_CAPTURE('/#([a-z0-9]+)/i', messagetext, occurrence));
SET i = i + 1;
END WHILE;
...
This will return a list of message-ids and hashtags. You then need to GROUP
them BY
hashtag, count them and ORDER
by count DESC
, finally adding LIMIT 5
to only get the five most popular.