I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the @
sign in front. The list of usernames is available as PHP array.
The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that.
Currently I am using a function which gets the whole line after the @
and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.
I have some ideas, but I don't know if they may work.
- Going through username list (could be >1.000 names or more) and search for all @Username without regex, just string search. I would say this would be far more inefficient.
- Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like @"User Name". Don't like that idea, that looks dirty for the user.
- Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.
How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?
This is my current function:
$regular_expression_match = '#(?:^|\\s)@(.+?)(?:
|$)#';
$matches = false;
$offset = 0;
while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
$line = $matches[1][0];
$search_string = substr($line, 0, 1);
$filtered_usernames = array_keys($user_list);
$matched_username = false;
// Loop, make the search string one by one char longer and see if we have still usernames matching
while (count($filtered_usernames) > 1)
{
$filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
$search_string = utf8_clean_string($search_string);
if (strlen($username_clean) == strlen($search_string))
{
if ($username_clean == $search_string)
{
$matched_username = $username_clean;
}
return false;
}
return (substr($username_clean, 0, strlen($search_string)) == $search_string);
});
if ($search_string == $line)
{
// We have reached the end of the line, so stop
break;
}
$search_string = substr($line, 0, strlen($search_string) + 1);
}
// If there is still one in filter, we check if it is matching
$first_username = reset($filtered_usernames);
if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
{
$matched_username = $first_username;
}
// We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
// So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
// This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
// and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
if ($matched_username)
{
$startpos = $matches[1][1];
// We need to get the endpos, cause the username is cleaned and the real string might be longer
$full_username = substr($post_text, $startpos, strlen($matched_username));
while (utf8_clean_string($full_username) != $matched_username)
{
$full_username = substr($post_text, $startpos, strlen($full_username) + 1);
}
$length = strlen($full_username);
$user_data = $user_list[$matched_username];
$mentioned[] = array_merge($user_data, array(
'type' => self::MENTION_AT,
'start' => $startpos,
'length' => $length,
));
}
$offset = $matches[0][1] + strlen($search_string);
}
Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.
I can't find out what's the best way, and even why my function is so time consuming.
A sample text would be:
Okay, @Firstname Lastname, I mention you! Listen @[TEAM] John, you are a team member. @Test is a normal name, but @Thât♥ should be tracked too. And see @Wolfs garden! I just mean the Wolf.
Usernames in that text would be
- Firstname Lastname
- [TEAM] John
- Test
- Thât♥
- Wolf
So yes, there is clearly nothing I know where a name may end. Only thing is the newline.