duankun9280
duankun9280
2016-12-06 01:52

非贪婪的正则表达式匹配不同的行为

已采纳

I found that non-greedy regex match only become non-greedy when anchoring to the front, not to the end:

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# YES! non-greedy match

Now look at this, when anchoring to the end:

$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# what, non-greedy become greedy?

why is that? how come it doesn't print abc as before?

(The problem was found in my Go code, but illustrated in Perl for simplicity).

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • dpi9530 dpi9530 5年前
    $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
    abcabcabc
    # what, non-greedy become greedy?
    

    Non-greedy means it'll match the fewest characters possible at the current location such that the entire pattern matches.

    After matching a at position 0, bcabcab is the least .*? can match at position 1 while still satisfying the rest of the pattern.

    "abcabcabc" = /a.*?c$/ in detail:

    1. At pos 0, a matches 1 char (a).
      1. At pos 1, .*? matches 0 chars (empty string).
        1. At pos 1, c fails to match. Backtrack!
      2. At pos 1, .*? matches 1 char (b).
        1. At pos 2, c matches 1 char (c).
          1. At pos 3, $ fails to match. Backtrack!
      3. At pos 1, .*? matches 2 chars (bc).
        1. At pos 1, c fails to match. Backtrack!
      4. ...
      5. At pos 1, .*? matches 7 chars (bcabcab).
        1. At pos 8, c matches 1 char (c).
          1. At pos 9, $ matches 0 chars (empty string). Match successful!

    "abcabcabc" = /a.*c$/ in detail (for contrast):

    1. At pos 0, a matches 1 char (a).
      1. At pos 1, .* matches 8 chars (abcabcabc).
        1. At pos 9, c fails to match. Backtrack!
      2. At pos 1, .* matches 7 chars (abcabcab).
        1. At pos 8, c matches 1 char (c).
          1. At pos 9, $ matches 0 chars (empty string). Match successful!

    Tip: Avoid patterns with two instances of a non-greediness modifier. Unless you are using them as an optimization, there's a good chance they can match something you don't want them to match. This is relevant here because patterns implicitly start with \G(?s:.*?)\K (unless cancelled out by a leading ^, \A or \G).

    What you want is one of the following:

    /a[^a]*c$/
    /a[^c]*c$/
    /a[^ac]*c$/
    

    You could also use one of the following:

    /a(?:(?!a).)c$/s
    /a(?:(?!c).)c$/s
    /a(?:(?!a|c).)c$/s
    

    It would be inefficient and unreadable to use these latter three in this situation, but they will work with boundaries that are longer than one character.

    点赞 评论 复制链接分享

相关推荐