doumeng1143 2013-04-10 19:05
浏览 203
已采纳

使用正则表达式将标记解析为抽象语法树

This question is supplementary to: Recursive processing of markup using Regular Expression and DOMDocument

The code supplied by the selected answer has been a great help to understand building a basic syntax tree. However I am now having troubles tightening the regular expressions to only match my syntax rather than {. but not {{. Ideally I would like it to only match my syntax which is:

{<anchor>}
{!image!}
{*strong*}
{/emphasis/}
{|code|}
{-strikethrough-}
{>small<}

Two tags, a and small also require differing end tags. I have tried modifying $re_closetag from the original code sample to reflect this but it still matches too much as text.

For example:

http://www.google.com/>} bang 
smäll<} boom 

My test string is:

tëstïng {{ 汉字/漢字 }} testing {<http://www.google.com/>} bang {>smäll<} boom {* strông{/ ëmphäsïs {- strïkë {| côdë |} -} /} *} {*wôw*} 1, 2, 3
  • 写回答

1条回答 默认 最新

  • dongwen9975 2013-04-10 20:01
    关注

    You can either control this in the RE itself or after a match.

    In the re, to control what tags may be "open" modify this part of $re_next:

    (?:\{(?P<opentag>[^{\s]))  # match an open tag
          #which is "{" followed by anything other than whitespace or another "{"
    

    Currently it looks for any character which is not { or whitespace. Simply change to this:

    (?:\{(?P<opentag>[<!*/|>-]))
    

    Now it looks for only your specific open tags.

    The close tag portion only matches a single character at a time depending on what tag is open in the current context. (This is what the $opentag argument is for.) So to match a pair of characters, simply change the $opentag to look for in the recursive call. E.g.:

            if (isset($m['opentag']) && $m['opentag'][1] !== -1) {
                list($newopen, $_) = $m['opentag'];
    
                // change the close character to look for in the new context
                if ($newopen==='>') $newopen = '<';
                else if ($newopen==='<') $newopen = '>';
    
                list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen);
                $ast[] = array($newopen, $subast);
            } else if (isset($m['text']) && $m['text'][1] !== -1) {
    

    Alternatively, you can keep the RE as-is and decide what to do with the match after the fact. For example, if you match a @ character but {@ is not an allowed open tag, you can either raise a parse error or simply treat it as a text node (attaching array('#text', '{@') to the ast), or anything in between.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于logstash转发日志时发生的部分内容丢失问题
  • ¥17 pro*C预编译“闪回查询”报错SCN不能识别
  • ¥15 微信会员卡接入微信支付商户号收款
  • ¥15 如何获取烟草零售终端数据
  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?