dongyaxiao5884 2015-07-27 21:13
浏览 136

正则表达式与golang中嵌套组的问题

Consider the following toy example. I want to match in Go a name with a regexp where the name is sequences of letters a separated by single #, so a#a#aaa is valid, but a# or a##a are not. I can code the regexp in the following two ways:

r1 := regexp.MustCompile(`^a+(#a+)*$`)
r2 := regexp.MustCompile(`^(a+#)*a+$`)

Both of these work. Now consider more complex task of matching a sequence of names separated by single slash. As in above, I can code that in two ways:

^N+(/N+)*$
^(N+/)*N+$

where N is a regexp for the name with ^ and $ stripped. As I have two cases for N, so now I can have 4 regexps:

    ^a+(#a+)*(/a+(#a+)*)*$
    ^(a+#)*a+(/a+(#a+)*)*$
    ^((a+#)*a+/)*a+(#a+)*$
    ^((a+#)*a+/)*(a+#)*a+$

The question is why when matching against the string "aa#a#a/a#a/a" the first one fails while the rest 3 cases work as expected? I.e. what causes the first regexp to mismatch? The full code is:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    str := "aa#a#a/a#a/a"
    regs := []string {
        `^a+(#a+)*(/a+(#a+)*)*$`,
        `^(a+#)*a+(/a+(#a+)*)*$`,
        `^((a+#)*a+/)*a+(#a+)*$`,
        `^((a+#)*a+/)*(a+#)*a+$`,
    }    
    for _, r := range(regs) {
        fmt.Println(regexp.MustCompile(r).MatchString(str))
    } 
}

Surprisingly it prints false true true true

  • 写回答

2条回答 默认 最新

  • drfif48428 2015-07-27 21:56
    关注

    You can try to alleviate atomic sub-nesting of quantifiers.
    If you didn't have the anchors, the expression would possibly blow up
    in backtracking if using nested optional quantifiers like this,
    when it can't find a direct solution.

    Go might be balking at that, forcing it to fail outright instead
    of massive backtracking, but not sure.

    Try something like this, see if it works.

     # ^a+(#?a)*(/a+(#?a)*)*$
    
     ^ 
     a+
     (                    # (1 start)
          \#?
          a 
     )*                   # (1 end)
     (                    # (2 start)
          /
          a+
          (                    # (3 start)
               \#?
               a 
          )*                   # (3 end)
     )*                   # (2 end)
     $
    

    edit: (Transposed from comment) If complexity is too high, some engines will not even compile it, some will silently fail at runtime, some will lock up in a backtracking loop.

    This little subtle difference is the problem

    bad: too complex (#?a+)*
    good: no nested, open ended quantifiers (#?a)*

    If you ever have this problem, strip out the nesting, usually solves
    the backtracking issue.


    eit2: If you require a separator and want to insure a separator is only in the middle and surrounded by a's, you could try this

    https://play.golang.org/p/oM6B6H3Kdx

     #  ^a+[#/](a[#/]?)*a+$
    
     ^ 
     a+
     [\#/] 
     (                             # (1 start)
          a
          [\#/]? 
    
     )*                            # (1 end)
     a+
     $ 
    

    or this

    https://play.golang.org/p/WihqSjH_dI

     # ^a+(?:[#/]a+)+$
    
     ^ 
     a+ 
     (?:
          [\#/] 
          a+
     )+
     $
    
    评论

报告相同问题?

悬赏问题

  • ¥15 写uniapp时遇到的问题
  • ¥15 matlab有限元法求解梁带有若干弹簧质量系统的固有频率
  • ¥15 找一个网络防御专家,外包的
  • ¥100 能不能让两张不同的图片md5值一样,(有尝)
  • ¥15 informer代码训练自己的数据集,改参数怎么改
  • ¥15 请看一下,学校实验要求,我需要具体代码
  • ¥50 pc微信3.6.0.18不能登陆 有偿解决问题
  • ¥20 MATLAB绘制两隐函数曲面的交线
  • ¥15 求TYPCE母转母转接头24PIN线路板图
  • ¥100 国外网络搭建,有偿交流