丢弃搜索词之前和之后的所有字符，但前10个字

I'm trying to finish the search function in one of the sites I'm developing. Since my search results only display excerpts of the contents of matched items, what I want to do is to highlight search terms within the search results and display only portions of texts that actually contain those search terms.

What I figured I'd do is to fetch the whole content from the database and use preg_replace to insert <span> elements around the search terms and at the same time extract only the first 10 words before and after the term. So this is the regex part of it:

(?:.*?)((?:\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

Basically, I try to "discard" all text except the first 10 words before the search term by using a non-capturing subpattern, then get the 10 words before the term, then the term itself, then the next 10 words.

This is the replacement text in preg_replace:

\\1<span class="search-term search-term-content">\\2</span>\\3...

The search term is being searched via the MySQL's MATCH()...AGAINST() for MyISAM FULLTEXT indeces on multiple columns. However, the above regex is only being applied in one column (let's call this column, the one that uses the above regex, content).

So my problem is whenever I get a match on other columns but not on the content column, the regex above strips all text from the content column. That's because of the (?:.*?) subpattern at the very beginning which continues to match without ever finding the next subpatterns.

I was wondering if there was any other way to implement the original purpose of the regex without this side effect. I am currently thinking of simply using preg_match_all to just match the search term and 10 words before and after it. I'll just iterate over all of the matches and build the preview text manually. Yes, this is a sound solution but given my inexperience with regex, I thought I might as well try to find a solution to this.

UPDATE

I just noticed that I only get blank contents when I put 2 or more search terms. Other than that, it works perfectly. I now have no idea why this is happening.

UPDATE 2

Echo'ing preg_last_error(), I get this error PREG_BACKTRACK_LIMIT_ERROR. I use the words new and post for the search terms.

A var_dump of the regex and the terms show this:

@(?:.*?)((?:\w+\W+){0,10})(new|post)((?:\W*\w+\W+){0,10})@i

array
  0 => string 'new' (length=3)
  1 => string 'post' (length=4)

UPDATE 3

I used Regex Coach to walk me through the matching pattern, it seems that it backtracks too much after it finds no match for (new|post). The target text is simply a random 3-paragraph lorem ipsum. I think I need to find a better regex for this task.

UPDATE 4

Using a Once-Only subpattern solves the problem. Though I have no idea of its details, I just re-read the PHP Manual and read a part of it that Once-Only subpatterns help with too much backtracking. This is the new regex:

(?:.*?)((?>\w+\W+){0,10})('.implode('|', $terms).')((?:\W*\w+\W+){0,10})

But I'm still open for suggestions for better regexes. Thanks!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dqyhj2014 2012-07-18 09:10
关注
If you're having issues with hitting the backtracking limit, you generally want to look at once-only subpatterns.

In this case however your main issue seems to be the (?:.*?) being followed by (?:\w+\W+){0,10}. Take for example the string 'hello world!', ignoring for now the {0,10}. This will match the two patterns as all of the following:

'' and 'hello '

'h' and 'ello '

'he' and 'llo '

'hel' and 'lo '

'hell' and 'o '

'hello ' and 'world!'

'hello w' and 'orld!'

'hello wo' and 'rld!'

'hello wor' and 'ld!'

'hello worl' and 'd!'

The easiest way to block this redundant backtracking is to add a word boundary check (\b) after the (?:.*?) subpattern. This will reduce these potential matches to

'' and 'hello '

'hello ' and 'world!'

EDIT: Here is an example of why a once-only subpattern will not work here:

preg_replace('/(?>[a-z]{0,2})a/','x','bac')

In this example we would expect the result 'xc', however the subpattern matches greedily to 'ba' and then never backtracks, thus missing the match. We could make the pattern ungreedy, but then we would get the result 'bxc', because it never backtracks after matching '' for the subpattern.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

丢弃搜索词之前和之后的所有字符，但前10个字 php
2012-07-18 07:13

回答 1 已采纳 If you're having issues with hitting the backtracking limit, you generally want to look at once-on
如何在Laravel PHP框架中合并两个集合而不丢弃（丢失）密钥？ laravel php
2018-07-26 09:00

回答 2 已采纳 This is because Eloquent Collections which extend Support Collections use a dictionary that is key
PHP从数组中保留三个元素，然后丢弃两个，然后保留三个，然后丢弃两个，等等 php
2017-04-03 16:20

回答 2 已采纳 Quite a few options for that, the cleanest code might be: $array = [0,1,2,3,4,5,6,7,8,9,10]; $new
PHP常用数组函数、字符串函数、正则函数
2018-05-15 11:35

--Zy--的博客数组函数： 1.键值操作类：【都有返回值，没有在原来...,true])//获取$arr中所有字符是"str"的下标，形成索引数组，true表示区分大小写【返回新索引数组】 3.array_search("is",$arr[,true]) ...
Python实现用户循环输入10个数字，若输入不是数字则丢弃；然后将用户输入的内容排序后输出 python 有问必答
2021-05-24 00:27

回答 4 已采纳 x=0 n=[] while x<10: i=input('请输入：') if i.isnumeric():#如果字符串中只包含数字字符，则返回 True，否则返回 False
如何将带有链接的字符串拆分为PHP中的数组或对象，并丢弃所有不属于该URL的内容？ php
2013-12-18 09:55

回答 2 已采纳 You can explode string into array and can get your required string. For example $url = http://ex
仅提交填充的文本框值并丢弃empt textbox ih form php html5 javascript php
2018-11-01 09:51

回答 1 已采纳 I think this could help you $product_id=$_POST['product_id']; $qty=$_POST['qty']; foreach ($qty a
python字符串筛选输出_如何在Python中过滤字符串列表
2021-01-13 00:42

海上行走的狮子的博客可以使用filter()方法从Python中的任何字符串、列表或字典中过滤一个或多个数值。它根据任何特定条件过滤数据。当条件返回true时，它将存储数据，而返回false时将丢弃数据。本文通过使用不同的示例展示了如何在...
想让抓到的所有数据包请求都被丢弃，python该怎么做 python 安全性测试有问必答网络安全
2021-11-22 11:08

回答 2 已采纳修改自己的电脑 hosts 文件，将接口地址ip修改成自己电脑或其他不能使用的ip，如果发送到本机，本机可以搭建个接口相应端，避免请求失败无法继续
在PHP fopen（）神秘丢弃的字节 ajax php
2015-12-31 23:24

回答 1 已采纳 You're calling fgetc() twice for each loop Firstly in the line while (fgetc($imageExposed) !== f
通过PHP开发了一个工序管理软件，以前一直都好着，最近陆续外网电脑无法登录，但是服务器127.0.0.1可以登录，局域网ip和外网域名都无法登录 php vue.js 服务器
2022-06-07 20:31

回答 5 已采纳你需要进一步测试以定位问题在本机上清除浏览器所有缓存，再登录试一次如果还是正常，表明服务应该是正常的如果不正常，表明服务就不正常，前面本机能登录估计和缓存有关如果php有相应接口，可以测试一下接口
PHP常用函数总结(180多个)[持续更新中...]
2016-09-11 17:14

大白技术控的博客 PHP常用函数总结数学函数abs(): 求绝对值$abs = abs(-4.2);... // 10 浮点数进一取整floor(): 舍去法取整 echo floor(9.999); // 9 浮点数直接舍去小数部分fmod(): 浮点数取余 $x = 5.7; $y = 1.3; // 两个浮点数
C语言判断字符串中的回文数，将其存放到一个二维数组里内。 c++ c语言
2021-08-22 10:23

回答 1 已采纳供参考： #include<stdio.h> #include<string.h> #include<stdlib.h> #include<math.h&gt
linux php q d 命令,linux常用命令详解和用法
2021-05-05 10:18

weixin_39845039的博客 linux常用命令详解和用法：1、reboot命令用于重启机器；2、ls命令用于查看linux文件夹包含的文件；3、cd切换命令用于切换当前目录至dirName；4、pwd命令用于查看当前工作目录路径；5、mkdir命令用于创建文件夹等等。...
一文带你认识30个重要的数据结构和算法
2022-02-25 14:44

华为云开发者联盟的博客摘要：掌握DSA意味着你能够使用你的计算和算法思维来解决前所未见的问题。通过了解它们，您可以提高代码的可维护性、可扩展性和效率。
PHP常用函数总结(180多个)
2018-06-01 10:19

行善积德韩老魔的博客 PHP常用函数总结本文源文件(markdown)： https://github.com/yanglr/AlgoSolutions/blob/master/PHP常用函数总结(160多个).md数学函数1.abs(): 求绝对值$abs = abs(-4.2); //4.21输入: 数字输出: 绝对值数字2....
【黄啊码】php函数大全，新手必备神器
2022-11-09 15:03

黄啊码的博客【黄啊码】php函数大全，新手必备神器
PHP常用函数总结（180多个）
2017-03-29 23:54

Jim仔的博客 PHP常用函数总结转载自：http://blog.csdn.net/lzuacm 数学函数 1.abs(): 求绝对值 $abs = abs(-4.2); //4.211 输入: 数字输出: 绝对值数字 2.ceil(): 进一法取整 echo ceil(9.999); // ...
没有解决我的问题, 去提问

悬赏问题

¥15 关于#java#的问题：找一份能快速看完mooc视频的代码
¥15 这种微信登录授权谁可以做啊
¥15 请问我该如何添加自己的数据去运行蚁群算法代码
¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
¥15 网络设备配置与管理这个该怎么弄
¥20 机器学习能否像多层线性模型一样处理嵌套数据
¥20 西门子S7-Graph,S7-300，梯形图
¥50 用易语言http 访问不了网页
¥50 safari浏览器fetch提交数据后数据丢失问题
¥15 matlab不知道怎么改，求解答！！

丢弃搜索词之前和之后的所有字符，但前10个字

1条回答 默认 最新

悬赏问题

1条回答默认最新