lrony* 2011-01-30 03:33 采纳率: 0%
浏览 239
已采纳

"现代"再现的认知力

What class of languages do real modern regexes actually recognise?

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

Links to relevant papers would be much appreciated.

转载于:https://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes

  • 写回答

1条回答 默认 最新

  • 7*4 2011-01-30 15:07
    关注

    Pattern Recursion

    With recursive patterns, you have a form of recursive descent matching.

    This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

    Restrictions on Recursion

    The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

    BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

    Recursion Demos

    The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

    \((?:[^()]*+|(?0))*\)
    

    I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

    \( (?: [^()] *+ | (?0) )* \)
    

    Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

    ‘ (?: [^‘’] *+ | (?0) )* ’
    

    Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

    ^((.)(?1)\2|.?)$
    

    which you can test on most systems using something like this:

    $ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words
    

    Note that PCRE’s implementation of recursion requires the more elaborate

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
    

    This is because of restrictions on how PCRE recursion works.

    Proper Parsing

    To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

    $rfc5322 = qr{
    
       (?(DEFINE)
    
         (?<address>         (?&mailbox) | (?&group))
         (?<mailbox>         (?&name_addr) | (?&addr_spec))
         (?<name_addr>       (?&display_name)? (?&angle_addr))
         (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
         (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
         (?<display_name>    (?&phrase))
         (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)
    
         (?<addr_spec>       (?&local_part) \@ (?&domain))
         (?<local_part>      (?&dot_atom) | (?&quoted_string))
         (?<domain>          (?&dot_atom) | (?&domain_literal))
         (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                       \] (?&CFWS)?)
         (?<dcontent>        (?&dtext) | (?&quoted_pair))
         (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
    
         (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
         (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
         (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
         (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)
    
         (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
         (?<quoted_pair>     \\ (?&text))
    
         (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
         (?<qcontent>        (?&qtext) | (?&quoted_pair))
         (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                              (?&FWS)? (?&DQUOTE) (?&CFWS)?)
    
         (?<word>            (?&atom) | (?&quoted_string))
         (?<phrase>          (?&word)+)
    
         # Folding white space
         (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
         (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
         (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
         (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
         (?<CFWS>            (?: (?&FWS)? (?&comment))*
                             (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
    
         # No whitespace control
         (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
    
         (?<ALPHA>           [A-Za-z])
         (?<DIGIT>           [0-9])
         (?<CRLF>            \x0d \x0a)
         (?<DQUOTE>          ")
         (?<WSP>             [\x20\x09])
       )
    
       (?&address)
    
    }x;
    

    As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use 5.010;
    use Data::Dumper "Dumper";
    
    my $rfc5322 = do {
        use Regexp::Grammars;    # ...the magic is lexically scoped
        qr{
    
        # Keep the big stick handy, just in case...
        # <debug:on>
    
        # Match this...
        <address>
    
        # As defined by these...
        <token: address>         <mailbox> | <group>
        <token: mailbox>         <name_addr> | <addr_spec>
        <token: name_addr>       <display_name>? <angle_addr>
        <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
        <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
        <token: display_name>    <phrase>
        <token: mailbox_list>    <[mailbox]> ** (,)
    
        <token: addr_spec>       <local_part> \@ <domain>
        <token: local_part>      <dot_atom> | <quoted_string>
        <token: domain>          <dot_atom> | <domain_literal>
        <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?
    
        <token: dcontent>        <dtext> | <quoted_pair>
        <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]
    
        <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
        <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
        <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
        <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*
    
        <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
        <token: quoted_pair>     \\ <.text>
    
        <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
        <token: qcontent>        <.qtext> | <.quoted_pair>
        <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                                 <.FWS>? <.DQUOTE> <.CFWS>?
    
        <token: word>            <.atom> | <.quoted_string>
        <token: phrase>          <.word>+
    
        # Folding white space
        <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
        <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
        <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
        <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
        <token: CFWS>            (?: <.FWS>? <.comment>)*
                                 (?: (?:<.FWS>? <.comment>) | <.FWS>)
    
        # No whitespace control
        <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
        <token: ALPHA>           [A-Za-z]
        <token: DIGIT>           [0-9]
        <token: CRLF>            \x0d \x0a
        <token: DQUOTE>          "
        <token: WSP>             [\x20\x09]
        }x;
    };
    
    while (my $input = <>) {
        if ($input =~ $rfc5322) {
            say Dumper \%/;       # ...the parse tree of any successful match
                                  # appears in this punctuation variable
        }
    }
    

    As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示