正则表达式PREG_BACKTRACK_LIMIT_ERROR在提取真正长文本非贪婪时

I have a string of the form:

Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text

I want to extract Really Really Long Text... from the string with a regular expression. Up until the first [Closing].

If I do a regular expression like this:

$pMatch = "'\[Opening\](.+)\[Closing\]'si";

That gives me:

Really Really Long Text...[Closing]More Text

I can also make it not greedy like this:

$pMatch = "'\[Opening\](.+?)\[Closing\]'si";

Which works and gives me the correct output:

Really Really Long Text...

However, if I replace "Really Really Long Text..." with actual really really long text, it doesn't work and instead I receive a PREG_BACKTRACK_LIMIT_ERROR. I don't get an error if I use the greedy regular expression. I just get the wrong output as in the first case.

I've been working with regular expressions for a while, but this one has me stumped. Is there a way to get this to work with a regular expression or is regular expression not suitable for this task?

Here is PHP code to reproduce the issue:

<?php

  $sShortString = "Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text";
  $sLongString = "Some Text[Opening]".str_repeat("BLAH", 1000000)."[Closing]More Text[Closing]Even More Text";

  $pGreedyMatch = "'\[Opening\](.+)\[Closing\]'si";
  $pNonGreedyMatch = "'\[Opening\](.+?)\[Closing\]'si";

  header("Content-Type: text/plain");

  if (preg_match($pGreedyMatch, $sShortString, $aMatch)) {
    echo "Greedy Match:
";
    print_r($aMatch);
  }

  if (preg_match($pNonGreedyMatch, $sShortString, $aMatch)) {
    echo "Non-Greedy Match:
";
    print_r($aMatch);
  }

  if (preg_match($pGreedyMatch, $sLongString, $aMatch)) {
    echo "Greedy Match:
";
    echo "Length: ".strlen($aMatch[1])."
";
  }

  if (preg_match($pNonGreedyMatch, $sLongString, $aMatch)) {
    echo "Non-Greedy Match:
";
    echo strlen($aMatch[1]);
  } else {
    echo "Non-Greedy Doesn't Match!
";
  }

  $iLastError = preg_last_error();
  if ($iLastError == PREG_BACKTRACK_LIMIT_ERROR) {
    echo "It's because the backtrack limit was exceeded!
";
  }

?>

I get the output:

Greedy Match:
Array
(
    [0] => [Opening]Really Really Long Text...[Closing]More Text[Closing]
    [1] => Really Really Long Text...[Closing]More Text
)
Non-Greedy Match:
Array
(
    [0] => [Opening]Really Really Long Text...[Closing]
    [1] => Really Really Long Text...
)
Greedy Match:
Length: 4000018
Non-Greedy Doesn't Match!
It's because the backtrack limit was exceeded!

I've got it working by using the greedy regular expression and using additional code to strip off the text from [Closing] onward. I would like to better understand what's happening behind the scenes, why it needs to do so much backtracking, and if there's a way that the regular expression can be modified so it performs the task.

I really appreciate any insight!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanqiu2064 2018-05-26 23:31
关注
A non-greedy quantifier has a cost because each time it reads a character, it has to check against the end of the pattern.

In the above pattern, each time the . in (.+?) matches, it does a check to see if the following characters match [Closing]. Each time this happens, and it doesn't match, it has to backtrack and continue the search. This is why the backtrack limit it used up.

The pattern can be rewritten like this:

'\[Opening\]([^\[]*(?:\[(?!Closing)[^\[]*)*)(*SKIP)\[Closing\]'si

Let's examine this pattern piece by piece to understand it.

1) We open with \[Opening\]. This pattern matches the opening tag.

2) As our pattern isn't repeating within itself, the ()(*SKIP) directive is used as a further optimization. It means that if we don't match the pattern then we will restart our search from the end of where we were looking. The default behaviour would start to search again at the next character.

To better understand this, imagine that our string is sometimes we get [Close to matching. When we get to the [, we scan [Clos before we conclude that this actually isn't the pattern we want. Normally, we backtrack and then start again looking at Close. However, (*SKIP) allows us to continue searching at e to matc.

3) Inside our brackets we start with the pattern [^\[]*, which allows us to match as many characters as we can which are not [. ^ indicates not, \[ is for the [, and [] surrounds it as a character set. * allows it to repeat as many times as possible.

4) Now, we have (?:)*. () allows us to specify a string, and ?: indicates that is not going to be saved, and * allows it to repeat as many times as we like (including no times at all).

5) The first character in that string is \[ or just the [ we expect as part of our closing tag.

6) Next, we have (?!Closing\]). (?!) is a negative lookahead. A lookahead means that the parser will look at the next characters and either match or fail to match without consuming the characters. This allows us to match something as long as it's not Closing] without actually consuming it.

7) We have another [^\[]* which allows us to continue to eat characters after our failure to lookahead. This allows us to continue to consume the string after we get something like [Clos.

8) Finally, our regular expression ends with \[Closing\].
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

改进此正则表达式以防止preg_replace抛出PREG_BACKTRACK_LIMIT_ERROR php
2019-02-28 22:01

回答 1 已采纳 I would strongly suggest not using regular expressions here, but making use of DOM parsing instead
PHP正则表达式preg_match在url中分隔变量 php
2017-07-14 14:14

回答 3 已采纳 Use the following preg_match_all approach: $str = 'foo/{id}/bar/{name}/many/{number}/of/{variable
使用正则表达式和php preg_match_all在括号之间获取字符串 php
2017-07-14 12:34

回答 2 已采纳 This method will extract your desired substrings and prepare the output data as you have requested
正则表达式PREG_BACKTRACK_LIMIT_ERROR错误
2018-07-09 14:33

donggu99的博客通过调用preg_last_error()获取正则错误代码代码为2，也就是PREG_BACKTRACK_LIMIT_ERROR，意思是回溯限制错误主要受php配置影响，可以把pcre.backtrack_limit参数设置更大或者直接设置成-1不受限制...
PHP正则表达式preg_replace函数Joomla！插入 php
2015-10-28 20:30

回答 1 已采纳 The problem is that you are using $1 and $2 backreferences that refer to Group 1 (i.e. substring c
PHP正则表达式preg_grep更改字符串路径 php
2015-09-25 00:26

回答 1 已采纳 preg_grep() does not change/replace the values, it returns the items that match the given regular
在多字符串字符串之前的PHP正则表达式preg_match数字 php
2016-02-25 23:13

回答 1 已采纳 Your regexp has two problems. First, there are other numbers in the string before the number of c
php正则表达式怎么验证,关于php：我如何验证正则表达式？
2021-04-12 15:03

weixin_39612110的博客我想在PHP中测试正则表达式的有效性，最好是在使用它之前。唯一的方法是尝试一个preg_match()，看看它是否返回FALSE？是否有更简单/正确的方法来测试有效的正则表达式？您的意思是：stackoverflow....
正则表达式preg_match PHP php
2015-05-14 22:45

回答 1 已采纳 if (!preg_match("/(!ipad|iphone|blackbary).*mobile/",$user_agent) { // show desktop version } els
正则表达式preg_match字 php
2012-11-21 13:16

回答 2 已采纳 For your match line, do the following instead: $regex = '/Name:(.*)/'; The matched portion (in
Php正则表达式preg_match重复字符 php
2012-10-31 04:50

回答 1 已采纳 What about something like (\d+).*\1 If you get any match there is a repeated number.
php的正则(如preg_match)因输入字符串太长而导致导致只匹配部分.
2019-11-18 19:54

Day____Day____Up的博客最近在使用php的正则preg_match是发现, 明明自己正则规则没有问题(自己用ide的正则搜索确实可以匹配到) 但是用php的preg_match()函数就是匹配不出, 也没有报错.以下是我的写法 if(preg_match('/<div id="test"&...
求php一条preg_match_all正则，取指定div的id开头？ php 正则表达式
2021-08-21 14:27

回答 1 已采纳 $reg = "/<div id=\"num_(.*?)_off\".*?>.*?<\/div>/ism";
CTF之preg_match()函数绕过
2023-03-06 10:45

安全天天学的博客 CTF之preg_match()函数绕过，preg_match()常用的绕过方法
正则表达式与绕过案例
2022-07-20 14:12

易华山的博客 正则表达式
php认识正则吗,PHP正则表达式的应用
2021-03-24 08:46

谋略那些事的博客关于PCRE的介绍以及实现正则表达式功能的所有说明，都可以在官方手册中看到：正则表达式(兼容 Perl)一、认识PCREPCRE 库是一个实现了与 perl 5 在语法和语义上略有差异的正则表达式模式匹配功能的函数集。...
php 正则报错,php正则表达式学习笔记
2021-04-20 03:19

weixin_39968436的博客 php正则表达式学习笔记分享：1.创建正则表达式$regex = '/\d/i';与JavaScript中的第一个方式有点像，只是这里的话是个字符串。...正则表达式中的函数有8个方法，preg_match与preg_match_all，preg_replace与pr...
PHP 正则表达式(PCRE)
2022-09-07 08:50

智慧浩海的博客 正则表达式(regular expression)描述了一种字符串匹配的模式，可以用来检查一个串是否含有某种子串、将匹配的子串做替换或者从某个串中取出符合某个条件的子串等。
php中pcre裤怎么调_转：php pcre正则表达式完全教程----pcre官方文档
2020-12-21 09:15

weixin_39562089的博客 PCRE简介PCRE扩展的正则表达式会有一个每个线程都可用的全局缓存用来缓存编译后的正则表达式.PCRE在php4.2.0中是默认启用的,可以通过—without-pcre-regex禁用.在php 5.3.0之后,这个扩展不能被禁用.但是仍然可以使用...
apache php 崩溃,win2003下PHP使用preg_match_all导致apache崩溃问题的解决方法
2021-04-25 00:32

索嵩的博客小编的平台是windows server 2003(32位系统) + Apache/2.2.9 (Win32) + PHP/5.2.17，在使用正则表达式 preg_match_all (如 preg_match_all("/ni(.*?)wo/", $html, $matches);)进行分析匹配比较长的字符串 $html 时...
没有解决我的问题, 去提问

悬赏问题

¥15 如何在scanpy上做差异基因和通路富集？
¥20 关于#硬件工程#的问题，请各位专家解答！
¥15 关于#matlab#的问题：期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707，使系统具有较小的超调量
¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配
¥15 Power query添加列问题
¥50 Kubernetes&Fission&Eleasticsearch
¥15 報錯：Person is not mapped，如何解決？
¥15 c++头文件不能识别CDialog

正则表达式PREG_BACKTRACK_LIMIT_ERROR在提取真正长文本非贪婪时

1条回答 默认 最新

悬赏问题

1条回答默认最新