dongman5539 2015-01-30 12:31
浏览 42

PHP performant搜索给定用户名的文本

I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the @ sign in front. The list of usernames is available as PHP array.

The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that. Currently I am using a function which gets the whole line after the @ and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.

I have some ideas, but I don't know if they may work.

  1. Going through username list (could be >1.000 names or more) and search for all @Username without regex, just string search. I would say this would be far more inefficient.
  2. Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like @"User Name". Don't like that idea, that looks dirty for the user.
  3. Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.

How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?

This is my current function:

$regular_expression_match = '#(?:^|\\s)@(.+?)(?:
|$)#';
$matches = false;
$offset = 0;

while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
    $line = $matches[1][0];
    $search_string = substr($line, 0, 1);
    $filtered_usernames = array_keys($user_list);
    $matched_username = false;

    // Loop, make the search string one by one char longer and see if we have still usernames matching
    while (count($filtered_usernames) > 1)
    {
        $filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
            $search_string = utf8_clean_string($search_string);

            if (strlen($username_clean) == strlen($search_string))
            {
                if ($username_clean == $search_string)
                {
                    $matched_username = $username_clean;
                }
                return false;
            }

            return (substr($username_clean, 0, strlen($search_string)) == $search_string);
        });

        if ($search_string == $line)
        {
            // We have reached the end of the line, so stop
            break;
        }
        $search_string = substr($line, 0, strlen($search_string) + 1);
    }

    //  If there is still one in filter, we check if it is matching
    $first_username = reset($filtered_usernames);
    if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
    {
        $matched_username = $first_username;
    }

    // We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
    // So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
    // This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
    // and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
    if ($matched_username)
    {
        $startpos = $matches[1][1];

        // We need to get the endpos, cause the username is cleaned and the real string might be longer
        $full_username = substr($post_text, $startpos, strlen($matched_username));
        while (utf8_clean_string($full_username) != $matched_username)
        {
            $full_username = substr($post_text, $startpos, strlen($full_username) + 1);
        }

        $length = strlen($full_username);
        $user_data = $user_list[$matched_username];

        $mentioned[] = array_merge($user_data, array(
            'type'          => self::MENTION_AT,
            'start'         => $startpos,
            'length'        => $length,
        ));
    }

    $offset = $matches[0][1] + strlen($search_string);
}

Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.

I can't find out what's the best way, and even why my function is so time consuming.

A sample text would be:

Okay, @Firstname Lastname, I mention you! Listen @[TEAM] John, you are a team member. @Test is a normal name, but @Thât♥ should be tracked too. And see @Wolfs garden! I just mean the Wolf.

Usernames in that text would be

  • Firstname Lastname
  • [TEAM] John
  • Test
  • Thât♥
  • Wolf

So yes, there is clearly nothing I know where a name may end. Only thing is the newline.

  • 写回答

1条回答 默认 最新

  • dongshengheng1013 2015-01-30 13:42
    关注

    I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John is part of [TEAM] John‌ or JohnFoo...

    It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.

    The usernames can now be extracted fast and with little effort and if needed still verified in db.

    $txt = "
    Okay, @\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
    Listen @\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
    @\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but 
    @\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
    And see @\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";
    
    // extract usernames
    if(preg_match_all('~@\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
      print_r($out[0]);
    }
    

    Array ( [0] => Firstname Lastname 1 => [TEAM] John 2 => Test 3 => Thât♥ 4 => Wolfs )

    echo $txt;

    Okay, @​Firstname Lastname, I mention you!
    Listen @​[TEAM] John‌, you are a team member.
    @​Test‌ is a normal name, but 
    @​Thât♥‌ should be tracked too.
    And see @​Wolfs‌ garden! I just mean the Wolf.
    

    Could use any characters you like and that possibly don't occur elsewhere for separation.

    Regex FAQ, Test at eval.in (link will expire soon)

    评论

报告相同问题?

悬赏问题

  • ¥35 引用csv数据文件(4列1800行),通过高斯-赛德尔法拟合曲线,在选取(每五十点取1点)数据,求该数据点的曲率中心。
  • ¥20 程序只发送0X01,串口助手显示不正确,配置看了没有问题115200-8-1-no,如何解决?
  • ¥15 Google speech command 数据集获取
  • ¥15 vue3+element-plus页面崩溃
  • ¥15 像这种代码要怎么跑起来?
  • ¥15 安卓C读取/dev/fastpipe屏幕像素数据
  • ¥15 pyqt5tools安装失败
  • ¥15 mmdetection
  • ¥15 nginx代理报502的错误
  • ¥100 当AWR1843发送完设置的固定帧后,如何使其再发送第一次的帧