Memor.の 2011-01-30 03:33 采纳率: 100%

已采纳

"现代"再现的认知力

What class of languages do real modern regexes actually recognise?

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

Links to relevant papers would be much appreciated.

转载于:https://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

~Onlooker 2011-01-30 15:07

关注

Pattern Recursion

With recursive patterns, you have a form of recursive descent matching.

This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

Restrictions on Recursion

The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

Recursion Demos

The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

\((?:[^()]*+|(?0))*\)

I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

\( (?: [^()] *+ | (?0) )* \)

Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

‘ (?: [^‘’] *+ | (?0) )* ’

Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

^((.)(?1)\2|.?)$

which you can test on most systems using something like this:

$ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words

Note that PCRE’s implementation of recursion requires the more elaborate

^(?:((.)(?1)\2|)|((.)(?3)\4|.))

This is because of restrictions on how PCRE recursion works.

Proper Parsing

To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]
    }x;
};

while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

"现代"再现的认知力 perl
2011-01-30 03:33

回答 1 已采纳 Pattern Recursion With recursive patterns, you have a form of recursive descent matching. This
跪求大神，编写算法实现并发进程的模拟算法
2016-03-29 04:57

回答 1 已采纳读者写者问题就是典型的并发处理，你可以去看看
C/C++ 编译后首次运行速度非常慢 c++ c语言 ide 有问必答
2022-03-03 16:05

回答 3 已采纳查看有没有reason cybersecurity这个破东西！！！会顶替你电脑自身的防护系统，一般都是捆绑软件下下来的。 C:\Program Files找到Ravantivirus文件夹，点进去找
数据管理，数据治理，数据管控
2021-09-27 07:50

bisal(Chen Liu)的博客政府获取数据、使用数据进行社会管理、社会管制及社会治理，是国家现代治理能力最重要的组成部分，是国家的核心竞争力。从多方面入手提升数据治理能力，是国家综合国力最高端、最重要、最核心的部分。政府在数据...
关于#python#的问题：统计python源程序目录.txt中第一行字符串指定的目录中，所有python程序文件扩展名为( python
2022-06-07 17:46

回答 3 已采纳所有python程序文件扩展名为(:py.pyw.pyc)中不重复的代码行数。这一句啥意思，是说1.py 和1.pyw算重复吗，另外代码中的空行需要计数吗先按猜测给个答案： import os wit
如果将电话语音数字化并以16K Hz （大于每秒8K Hz ）对其进行采样，那么数据大小有什么区别？ java javascript 其他有问必答
2022-04-07 15:14

回答 3 已采纳传统上电话的语音采样率主要是为 8 kHz，如果采用16kHz的采样率，则采样频率提高一倍。根据奈奎斯特定理可知，采样率必须大于被测信号感兴趣最高频率分量的两倍，才能保证信号不失真。而人声的频率是：男
求解敌人飞船就是一动不动怎么办？ c# unity
2021-11-15 13:34

回答 1 已采纳小伙子描述问题得清晰一点，这样没办法帮你啊，没有代码没有实例设置什么都不清楚
成为编程大牛很简单，把这些书看个八成就OK
2015-05-28 21:38

拭心的博客不同，这本书采用自下而上的方式，从二进制，和数字逻辑这些底层知识一步步过渡到高级编程语言（C），从而以另一种方式理解计算机系统。 2. 编程语言 编程语言是程序员必不可少的日常工具。工欲善其事，...
过滤器已经处理乱码，但修改后乱码再现
2009-06-16 13:07

回答 3 已采纳有几种情况： 1.你说的添加和修改页面的编码是否一致，有可能你的添加页面和过滤器的处理乱码设定相同，但修改页面不同，所以出现修改时存入资料库为乱码，仔细看下页面。 2.在LazyValidator
Python认知篇：常见数据类型--字符串+元组
2022-04-27 16:18

五包辣条！的博客用*实现字符串的重复是非常有意思的一个运算符，在很多编程语言中，要表示一个有10个a的字符串，你只能写成"aaaaaaaaaa"，但是在Python中，你可以写成'a' * 10。你可能觉得"aaaaaaaaaa"这种写法也没有什么不方便的...
中国开源正在走向成熟！
2020-07-15 20:53

CrisAppleYan的博客未来的运行时将会是 Web Assembly（WASM，一种可以使用非 JavaScript 编程语言编写代码并且能在浏览器上运行的技术方案）+Kubernetes；下一代 IDE 将会是云（原生）IDE。同时，中国是 Kubernetes 的第二大贡献...
论人机关系
2022-09-12 00:00

人机与认知实验室的博客当前，人机交互研究正在从传统走向现代，人机交互正在从浅水区逐步走向深水区，从脖子以下走向脖子以上，从个体工效走向群体智能，从生理心理测量走向意图意向破解，从数理物理的计算实证走向认知生态的算计抽象，...
让游戏超越游戏多款功能游戏集中亮相腾讯游戏发布会
2021-05-17 11:34

cover_liar的博客并发布3款功能游戏产品《健康保卫战》《雁丘陵》《巴甫洛夫很忙》与1款青少年自然科学教育产品《小鹅科学馆》，分别围绕大众健康知识科普、传统文化数字化传承、脑力认知训练研究、自然科学趣味教学等方向进行探索，...
2022年 Q1书单：17本书《可口可乐传》《随机漫步的傻瓜》等 | δ星丨读书笔记与书单 notes...
2022-04-18 22:45

punkboy的理想星球的博客出品丨punkboy的理想星球作者丨punkboy公众号：punkboy的理想星球总第180篇文章今年前三个月的读书计划...跟读过的许多本“企业传记”类的书籍不同，这本书可以说是比较真实客观的再现了一个产品和公司跌宕起伏的...
谷歌科学家：目标优化不好使？今天聊聊泛化这件事儿
2021-11-23 19:08

Charmve的博客身处机器学习时代的我们通常头脑被目标函数和优化算法所充斥。这可能会将我们禁锢到认知的角落中无法脱身。当我们跳出这个怪圈...
书单｜双十一必入的科普口碑好书
2022-10-28 11:48

turingbooks的博客 | 加州理工学院理论物理研究所所长、东京大学Kavli 数学物理联合宇宙研究机构研究主任大栗博司教授科普力作 | 时空概念的第三次革命，现代物理学的前沿课题 | 启发对宇宙的极致思考，颠覆对世界的传统认知 “超弦...
没有解决我的问题, 去提问

悬赏问题

¥15 Error in check.length("fill") : 'gpar'成分'fill'的长度不能为零
¥15 python：excel数据写入多个对应word文档
¥60 全一数分解素因子和素数循环节位数
¥15 ffmpeg如何安装到虚拟环境
¥188 寻找能做王者评分提取的
¥15 matlab用simulink求解一个二阶微分方程，要求截图
¥30 乘子法解约束最优化问题的matlab代码文件，最好有matlab代码文件
¥15 写论文，需要数据支撑
¥15 identifier of an instance of 类 was altered from xx to xx错误
¥100 反编译微信小游戏求指导

码龄粉丝数原力等级 --

"现代"再现的认知力

1条回答默认最新

码龄粉丝数原力等级 --

Pattern Recursion

Restrictions on Recursion

Recursion Demos

Proper Parsing

悬赏问题

"现代"再现的认知力

1条回答 默认 最新

Pattern Recursion

Restrictions on Recursion

Recursion Demos

Proper Parsing

悬赏问题

1条回答默认最新