I'm trying to finish the search function in one of the sites I'm developing. Since my search results only display excerpts of the contents of matched items, what I want to do is to highlight search terms within the search results and display only portions of texts that actually contain those search terms.
What I figured I'd do is to fetch the whole content from the database and use preg_replace
to insert <span>
elements around the search terms and at the same time extract only the first 10 words before and after the term. So this is the regex part of it:
(?:.*?)((?:\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})
Basically, I try to "discard" all text except the first 10 words before the search term by using a non-capturing subpattern, then get the 10 words before the term, then the term itself, then the next 10 words.
This is the replacement text in preg_replace
:
\\1<span class="search-term search-term-content">\\2</span>\\3...
The search term is being searched via the MySQL
's MATCH()...AGAINST()
for MyISAM FULLTEXT
indeces on multiple columns. However, the above regex is only being applied in one column (let's call this column, the one that uses the above regex, content
).
So my problem is whenever I get a match on other columns but not on the content
column, the regex above strips all text from the content
column. That's because of the (?:.*?)
subpattern at the very beginning which continues to match without ever finding the next subpatterns.
I was wondering if there was any other way to implement the original purpose of the regex without this side effect. I am currently thinking of simply using preg_match_all
to just match the search term and 10 words before and after it. I'll just iterate over all of the matches and build the preview text manually. Yes, this is a sound solution but given my inexperience with regex, I thought I might as well try to find a solution to this.
UPDATE
I just noticed that I only get blank contents
when I put 2 or more search terms. Other than that, it works perfectly. I now have no idea why this is happening.
UPDATE 2
Echo'ing preg_last_error()
, I get this error PREG_BACKTRACK_LIMIT_ERROR
. I use the words new
and post
for the search terms.
A var_dump
of the regex and the terms show this:
@(?:.*?)((?:\w+\W+){0,10})(new|post)((?:\W*\w+\W+){0,10})@i
array
0 => string 'new' (length=3)
1 => string 'post' (length=4)
UPDATE 3
I used Regex Coach
to walk me through the matching pattern, it seems that it backtracks too much after it finds no match for (new|post)
. The target text is simply a random 3-paragraph lorem ipsum. I think I need to find a better regex for this task.
UPDATE 4
Using a Once-Only
subpattern solves the problem. Though I have no idea of its details, I just re-read the PHP Manual and read a part of it that Once-Only
subpatterns help with too much backtracking. This is the new regex:
(?:.*?)((?>\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})
But I'm still open for suggestions for better regexes. Thanks!