douzhaxian1267 2012-07-18 07:13
浏览 41
已采纳

丢弃搜索词之前和之后的所有字符,但前10个字

I'm trying to finish the search function in one of the sites I'm developing. Since my search results only display excerpts of the contents of matched items, what I want to do is to highlight search terms within the search results and display only portions of texts that actually contain those search terms.

What I figured I'd do is to fetch the whole content from the database and use preg_replace to insert <span> elements around the search terms and at the same time extract only the first 10 words before and after the term. So this is the regex part of it:

(?:.*?)((?:\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

Basically, I try to "discard" all text except the first 10 words before the search term by using a non-capturing subpattern, then get the 10 words before the term, then the term itself, then the next 10 words.

This is the replacement text in preg_replace:

\\1<span class="search-term search-term-content">\\2</span>\\3...

The search term is being searched via the MySQL's MATCH()...AGAINST() for MyISAM FULLTEXT indeces on multiple columns. However, the above regex is only being applied in one column (let's call this column, the one that uses the above regex, content).

So my problem is whenever I get a match on other columns but not on the content column, the regex above strips all text from the content column. That's because of the (?:.*?) subpattern at the very beginning which continues to match without ever finding the next subpatterns.

I was wondering if there was any other way to implement the original purpose of the regex without this side effect. I am currently thinking of simply using preg_match_all to just match the search term and 10 words before and after it. I'll just iterate over all of the matches and build the preview text manually. Yes, this is a sound solution but given my inexperience with regex, I thought I might as well try to find a solution to this.

UPDATE

I just noticed that I only get blank contents when I put 2 or more search terms. Other than that, it works perfectly. I now have no idea why this is happening.

UPDATE 2

Echo'ing preg_last_error(), I get this error PREG_BACKTRACK_LIMIT_ERROR. I use the words new and post for the search terms.

A var_dump of the regex and the terms show this:

@(?:.*?)((?:\w+\W+){0,10})(new|post)((?:\W*\w+\W+){0,10})@i

array
  0 => string 'new' (length=3)
  1 => string 'post' (length=4)

UPDATE 3

I used Regex Coach to walk me through the matching pattern, it seems that it backtracks too much after it finds no match for (new|post). The target text is simply a random 3-paragraph lorem ipsum. I think I need to find a better regex for this task.

UPDATE 4

Using a Once-Only subpattern solves the problem. Though I have no idea of its details, I just re-read the PHP Manual and read a part of it that Once-Only subpatterns help with too much backtracking. This is the new regex:

(?:.*?)((?>\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

But I'm still open for suggestions for better regexes. Thanks!

  • 写回答

1条回答 默认 最新

  • dqyhj2014 2012-07-18 09:10
    关注

    If you're having issues with hitting the backtracking limit, you generally want to look at once-only subpatterns.

    In this case however your main issue seems to be the (?:.*?) being followed by (?:\w+\W+){0,10}. Take for example the string 'hello world!', ignoring for now the {0,10}. This will match the two patterns as all of the following:

    • '' and 'hello '
    • 'h' and 'ello '
    • 'he' and 'llo '
    • 'hel' and 'lo '
    • 'hell' and 'o '
    • 'hello ' and 'world!'
    • 'hello w' and 'orld!'
    • 'hello wo' and 'rld!'
    • 'hello wor' and 'ld!'
    • 'hello worl' and 'd!'

    The easiest way to block this redundant backtracking is to add a word boundary check (\b) after the (?:.*?) subpattern. This will reduce these potential matches to

    • '' and 'hello '
    • 'hello ' and 'world!'

    EDIT: Here is an example of why a once-only subpattern will not work here:

    preg_replace('/(?>[a-z]{0,2})a/','x','bac')
    

    In this example we would expect the result 'xc', however the subpattern matches greedily to 'ba' and then never backtracks, thus missing the match. We could make the pattern ungreedy, but then we would get the result 'bxc', because it never backtracks after matching '' for the subpattern.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于#java#的问题:找一份能快速看完mooc视频的代码
  • ¥15 这种微信登录授权 谁可以做啊
  • ¥15 请问我该如何添加自己的数据去运行蚁群算法代码
  • ¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
  • ¥15 网络设备配置与管理这个该怎么弄
  • ¥20 机器学习能否像多层线性模型一样处理嵌套数据
  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!