dongshanni1611 2015-06-27 12:49
浏览 55
已采纳

if-else在递归正则表达式中没有按预期工作

I am using a regex to parse some BBCode, so the regex has to work recursively to also match tags inside others. Most of the BBCode has an argument, and sometimes it's quoted, though not always.

A simplified equivalent of the regex I'm using (with html style tags to reduce the escaping needed) is this:

'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
  ([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'

However, if I have a test string that looks like this:

<"a">Content<a>Also Content</a></a>

it only matches the <a>Also Content</a> because when it tries to match from the first tag, the first matching group, \1, is set to ", and this is not overwritten when the regex is run recursively to match the inner tag, which means that because it isn't quoted, it doesn't match and that regex fails.

If instead I consistently either use or don't use quotes, it works fine, but I can't be sure that that will be the case with the content that I have to parse. Is there any way to work around this?


The full regex that I'm using, to match [spoiler]content[/spoiler], [spoiler=option]content[/spoiler] and [spoiler="option"]content[/spoiler], is

"~\[spoiler\s*+ #Match the opening tag
            (?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
          (?:\ *(?:
|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
          ((?:[^\[
]++ #Capture all characters until the closing tag
            |
(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
            |\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
            |(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
                     # without messing up nesting
          
? #Get rid of an extra new line before the closing tag if necessary
          \[/spoiler] #match the closing tag
         ~xi"

There are a couple of other bugs with it as well though.

  • 写回答

2条回答 默认 最新

  • dongyuanliao6204 2015-06-27 12:58
    关注

    The simplest solution is to use alternatives instead:

    <(?:a|"a")>
      ([^<]++ | (?R))*
    </a>
    

    But if you really don't want to repeat that a part, you can do the following:

    <("?)a\1>
      ([^<]++ | (?R))*
    </a>
    

    Demo

    I've just put the conditional ? inside the group. This time, the capturing group always matches, but the match can be empty, and the conditional isn't necessary anymore.

    Side note: I've applied a possessive quantifier to [^<] to avoid catastrophic backtracking.


    In your case I believe it's better to match a generic tag than a specific one. Match all tags, and then decide in your code what to do with the match.

    Here's a full regex:

    \[
      (?<tag>\w+) \s*
      (?:=\s*
        (?:
          (?<quote>["']) (?<arg>.{0,100}?) \k<quote>
          | (?<arg>[^\]]+)
        )
      )?
    \]
    
    (?<content>
      (?:[^[]++ | (?R) )*+
    )
    
    \[/\k<tag>\]
    

    Demo

    Note that I added the J option (PCRE_DUPNAMES) to be able to use (?<arg>...) twice.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度