关闭
dousui6488 2012-01-25 06:10 采纳率: 100%
浏览 58
已采纳

无法理解两个preg_match模式之间的区别

in an original code (Drupal core module) previous developer commented out the string:

if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/i', $name)) {

and instead, added:

if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/iu', $name)) {

Can you help me to understand what the difference between these two? What u modifier does? In php docs I found:

u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

So I guess, previous developer had problems with interpreting special characters or something. I'm a bit puzzled, please advice on this.

  • 写回答

2条回答 默认 最新

  • doulin6761 2012-01-25 06:37
    关注

    The modifier is needed to process utf-8 encoded input properly. A pattern like \xC1 should match the unicode character U+00C1 (À). When you encode Á in utf-8 you get \xC3\x81, so \xC1 doesn't match. The "u" modifier makes the algorithm use utf-8 so it does match.

    Basically, when you work with utf-8 encoded text this is what will happen:

    <?php
    var_dump(preg_match('/\xC1/u', 'Á'));
    // => int(1), matches
    
    var_dump(preg_match('/\xC1/', 'Á'));
    // => int(0), doesn't match
    ?>
    

    In your case the first regular expression [^\x80-\xF7] matches no (non-ascii) UTF-8 encoded text because of the way UTF-8 works. The second expression matches unicode characters outside of the range U+0080 - U+00F7, so it lets through all of cyrillic, greek, arab, hebrew, ...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)
编辑
预览

报告相同问题?

手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部