dqwh2717 2018-05-26 06:16
浏览 132

正则表达式PREG_BACKTRACK_LIMIT_ERROR在提取真正长文本非贪婪时

I have a string of the form:

Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text

I want to extract Really Really Long Text... from the string with a regular expression. Up until the first [Closing].

If I do a regular expression like this:

$pMatch = "'\[Opening\](.+)\[Closing\]'si";

That gives me:

Really Really Long Text...[Closing]More Text

I can also make it not greedy like this:

$pMatch = "'\[Opening\](.+?)\[Closing\]'si";

Which works and gives me the correct output:

Really Really Long Text...

However, if I replace "Really Really Long Text..." with actual really really long text, it doesn't work and instead I receive a PREG_BACKTRACK_LIMIT_ERROR. I don't get an error if I use the greedy regular expression. I just get the wrong output as in the first case.

I've been working with regular expressions for a while, but this one has me stumped. Is there a way to get this to work with a regular expression or is regular expression not suitable for this task?

Here is PHP code to reproduce the issue:

<?php

  $sShortString = "Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text";
  $sLongString = "Some Text[Opening]".str_repeat("BLAH", 1000000)."[Closing]More Text[Closing]Even More Text";

  $pGreedyMatch = "'\[Opening\](.+)\[Closing\]'si";
  $pNonGreedyMatch = "'\[Opening\](.+?)\[Closing\]'si";

  header("Content-Type: text/plain");

  if (preg_match($pGreedyMatch, $sShortString, $aMatch)) {
    echo "Greedy Match:
";
    print_r($aMatch);
  }

  if (preg_match($pNonGreedyMatch, $sShortString, $aMatch)) {
    echo "Non-Greedy Match:
";
    print_r($aMatch);
  }

  if (preg_match($pGreedyMatch, $sLongString, $aMatch)) {
    echo "Greedy Match:
";
    echo "Length: ".strlen($aMatch[1])."
";
  }

  if (preg_match($pNonGreedyMatch, $sLongString, $aMatch)) {
    echo "Non-Greedy Match:
";
    echo strlen($aMatch[1]);
  } else {
    echo "Non-Greedy Doesn't Match!
";
  }

  $iLastError = preg_last_error();
  if ($iLastError == PREG_BACKTRACK_LIMIT_ERROR) {
    echo "It's because the backtrack limit was exceeded!
";
  }

?>

I get the output:

Greedy Match:
Array
(
    [0] => [Opening]Really Really Long Text...[Closing]More Text[Closing]
    [1] => Really Really Long Text...[Closing]More Text
)
Non-Greedy Match:
Array
(
    [0] => [Opening]Really Really Long Text...[Closing]
    [1] => Really Really Long Text...
)
Greedy Match:
Length: 4000018
Non-Greedy Doesn't Match!
It's because the backtrack limit was exceeded!

I've got it working by using the greedy regular expression and using additional code to strip off the text from [Closing] onward. I would like to better understand what's happening behind the scenes, why it needs to do so much backtracking, and if there's a way that the regular expression can be modified so it performs the task.

I really appreciate any insight!

  • 写回答

1条回答 默认 最新

  • duanqiu2064 2018-05-26 23:31
    关注

    A non-greedy quantifier has a cost because each time it reads a character, it has to check against the end of the pattern.

    In the above pattern, each time the . in (.+?) matches, it does a check to see if the following characters match [Closing]. Each time this happens, and it doesn't match, it has to backtrack and continue the search. This is why the backtrack limit it used up.

    The pattern can be rewritten like this:

    '\[Opening\]([^\[]*(?:\[(?!Closing)[^\[]*)*)(*SKIP)\[Closing\]'si
    

    Let's examine this pattern piece by piece to understand it.

    1) We open with \[Opening\]. This pattern matches the opening tag.

    2) As our pattern isn't repeating within itself, the ()(*SKIP) directive is used as a further optimization. It means that if we don't match the pattern then we will restart our search from the end of where we were looking. The default behaviour would start to search again at the next character.

    To better understand this, imagine that our string is sometimes we get [Close to matching. When we get to the [, we scan [Clos before we conclude that this actually isn't the pattern we want. Normally, we backtrack and then start again looking at Close. However, (*SKIP) allows us to continue searching at e to matc.

    3) Inside our brackets we start with the pattern [^\[]*, which allows us to match as many characters as we can which are not [. ^ indicates not, \[ is for the [, and [] surrounds it as a character set. * allows it to repeat as many times as possible.

    4) Now, we have (?:)*. () allows us to specify a string, and ?: indicates that is not going to be saved, and * allows it to repeat as many times as we like (including no times at all).

    5) The first character in that string is \[ or just the [ we expect as part of our closing tag.

    6) Next, we have (?!Closing\]). (?!) is a negative lookahead. A lookahead means that the parser will look at the next characters and either match or fail to match without consuming the characters. This allows us to match something as long as it's not Closing] without actually consuming it.

    7) We have another [^\[]* which allows us to continue to eat characters after our failure to lookahead. This allows us to continue to consume the string after we get something like [Clos.

    8) Finally, our regular expression ends with \[Closing\].

    评论

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog