dongshanni1611 2015-06-27 12:49
浏览 55
已采纳

if-else在递归正则表达式中没有按预期工作

I am using a regex to parse some BBCode, so the regex has to work recursively to also match tags inside others. Most of the BBCode has an argument, and sometimes it's quoted, though not always.

A simplified equivalent of the regex I'm using (with html style tags to reduce the escaping needed) is this:

'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
  ([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'

However, if I have a test string that looks like this:

<"a">Content<a>Also Content</a></a>

it only matches the <a>Also Content</a> because when it tries to match from the first tag, the first matching group, \1, is set to ", and this is not overwritten when the regex is run recursively to match the inner tag, which means that because it isn't quoted, it doesn't match and that regex fails.

If instead I consistently either use or don't use quotes, it works fine, but I can't be sure that that will be the case with the content that I have to parse. Is there any way to work around this?


The full regex that I'm using, to match [spoiler]content[/spoiler], [spoiler=option]content[/spoiler] and [spoiler="option"]content[/spoiler], is

"~\[spoiler\s*+ #Match the opening tag
            (?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
          (?:\ *(?:
|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
          ((?:[^\[
]++ #Capture all characters until the closing tag
            |
(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
            |\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
            |(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
                     # without messing up nesting
          
? #Get rid of an extra new line before the closing tag if necessary
          \[/spoiler] #match the closing tag
         ~xi"

There are a couple of other bugs with it as well though.

  • 写回答

2条回答 默认 最新

  • dongyuanliao6204 2015-06-27 12:58
    关注

    The simplest solution is to use alternatives instead:

    <(?:a|"a")>
      ([^<]++ | (?R))*
    </a>
    

    But if you really don't want to repeat that a part, you can do the following:

    <("?)a\1>
      ([^<]++ | (?R))*
    </a>
    

    Demo

    I've just put the conditional ? inside the group. This time, the capturing group always matches, but the match can be empty, and the conditional isn't necessary anymore.

    Side note: I've applied a possessive quantifier to [^<] to avoid catastrophic backtracking.


    In your case I believe it's better to match a generic tag than a specific one. Match all tags, and then decide in your code what to do with the match.

    Here's a full regex:

    \[
      (?<tag>\w+) \s*
      (?:=\s*
        (?:
          (?<quote>["']) (?<arg>.{0,100}?) \k<quote>
          | (?<arg>[^\]]+)
        )
      )?
    \]
    
    (?<content>
      (?:[^[]++ | (?R) )*+
    )
    
    \[/\k<tag>\]
    

    Demo

    Note that I added the J option (PCRE_DUPNAMES) to be able to use (?<arg>...) twice.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 什么设备可以研究OFDM的60GHz毫米波信道模型
  • ¥15 不知道是该怎么引用多个函数片段
  • ¥15 爬取1-112页所有帖子的标题但是12页后要登录后才能 我使用selenium模拟登录 账号密码输入后 会报错 不知道怎么弄了
  • ¥30 关于用python写支付宝扫码付异步通知收不到的问题
  • ¥50 vue组件中无法正确接收并处理axios请求
  • ¥15 隐藏系统界面pdf的打印、下载按钮
  • ¥15 基于pso参数优化的LightGBM分类模型
  • ¥15 安装Paddleocr时报错无法解决
  • ¥15 python中transformers可以正常下载,但是没有办法使用pipeline
  • ¥50 分布式追踪trace异常问题