程序go 2011-01-30 03:33 采纳率: 100%
浏览 194
已采纳

"现代"再现的认知力

What class of languages do real modern regexes actually recognise?

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

Links to relevant papers would be much appreciated.

转载于:https://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes

  • 写回答

1条回答 默认 最新

  • 狐狸.fox 2011-01-30 15:07
    关注

    Pattern Recursion

    With recursive patterns, you have a form of recursive descent matching.

    This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

    Restrictions on Recursion

    The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

    BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

    Recursion Demos

    The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

    \((?:[^()]*+|(?0))*\)
    

    I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

    \( (?: [^()] *+ | (?0) )* \)
    

    Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

    ‘ (?: [^‘’] *+ | (?0) )* ’
    

    Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

    ^((.)(?1)\2|.?)$
    

    which you can test on most systems using something like this:

    $ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words
    

    Note that PCRE’s implementation of recursion requires the more elaborate

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
    

    This is because of restrictions on how PCRE recursion works.

    Proper Parsing

    To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

    $rfc5322 = qr{
    
       (?(DEFINE)
    
         (?<address>         (?&mailbox) | (?&group))
         (?<mailbox>         (?&name_addr) | (?&addr_spec))
         (?<name_addr>       (?&display_name)? (?&angle_addr))
         (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
         (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
         (?<display_name>    (?&phrase))
         (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)
    
         (?<addr_spec>       (?&local_part) \@ (?&domain))
         (?<local_part>      (?&dot_atom) | (?&quoted_string))
         (?<domain>          (?&dot_atom) | (?&domain_literal))
         (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                       \] (?&CFWS)?)
         (?<dcontent>        (?&dtext) | (?&quoted_pair))
         (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
    
         (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
         (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
         (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
         (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)
    
         (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
         (?<quoted_pair>     \\ (?&text))
    
         (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
         (?<qcontent>        (?&qtext) | (?&quoted_pair))
         (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                              (?&FWS)? (?&DQUOTE) (?&CFWS)?)
    
         (?<word>            (?&atom) | (?&quoted_string))
         (?<phrase>          (?&word)+)
    
         # Folding white space
         (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
         (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
         (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
         (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
         (?<CFWS>            (?: (?&FWS)? (?&comment))*
                             (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
    
         # No whitespace control
         (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
    
         (?<ALPHA>           [A-Za-z])
         (?<DIGIT>           [0-9])
         (?<CRLF>            \x0d \x0a)
         (?<DQUOTE>          ")
         (?<WSP>             [\x20\x09])
       )
    
       (?&address)
    
    }x;
    

    As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use 5.010;
    use Data::Dumper "Dumper";
    
    my $rfc5322 = do {
        use Regexp::Grammars;    # ...the magic is lexically scoped
        qr{
    
        # Keep the big stick handy, just in case...
        # <debug:on>
    
        # Match this...
        <address>
    
        # As defined by these...
        <token: address>         <mailbox> | <group>
        <token: mailbox>         <name_addr> | <addr_spec>
        <token: name_addr>       <display_name>? <angle_addr>
        <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
        <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
        <token: display_name>    <phrase>
        <token: mailbox_list>    <[mailbox]> ** (,)
    
        <token: addr_spec>       <local_part> \@ <domain>
        <token: local_part>      <dot_atom> | <quoted_string>
        <token: domain>          <dot_atom> | <domain_literal>
        <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?
    
        <token: dcontent>        <dtext> | <quoted_pair>
        <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]
    
        <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
        <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
        <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
        <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*
    
        <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
        <token: quoted_pair>     \\ <.text>
    
        <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
        <token: qcontent>        <.qtext> | <.quoted_pair>
        <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                                 <.FWS>? <.DQUOTE> <.CFWS>?
    
        <token: word>            <.atom> | <.quoted_string>
        <token: phrase>          <.word>+
    
        # Folding white space
        <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
        <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
        <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
        <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
        <token: CFWS>            (?: <.FWS>? <.comment>)*
                                 (?: (?:<.FWS>? <.comment>) | <.FWS>)
    
        # No whitespace control
        <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
        <token: ALPHA>           [A-Za-z]
        <token: DIGIT>           [0-9]
        <token: CRLF>            \x0d \x0a
        <token: DQUOTE>          "
        <token: WSP>             [\x20\x09]
        }x;
    };
    
    while (my $input = <>) {
        if ($input =~ $rfc5322) {
            say Dumper \%/;       # ...the parse tree of any successful match
                                  # appears in this punctuation variable
        }
    }
    

    As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?
  • ¥100 求三轴之间相互配合画圆以及直线的算法
  • ¥100 c语言,请帮蒟蒻写一个题的范例作参考
  • ¥15 名为“Product”的列已属于此 DataTable
  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题