RegEx字表现：\ w vs [a-zA-Z0-9_]

I'd like to know the list of chars that \w passes, is it just [a-zA-Z0-9_] or are there more chars that it might cover?

I'm asking this question, because based on this, \d is different with [0-9] and is less efficient.

\w vs [a-zA-Z0-9_]: which one might be faster in large scale?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doutun9179 2019-04-16 08:42
关注
[ This answer is Perl-specific. The information within may not apply to PCRE or the engine used by the other languages tagged. ]

/\w/aa (the actual equivalent of /[a-zA-Z0-9_]/) is usually faster, but not always. That said, the difference is so minimal (less than 1 nanosecond per check) that it shouldn't be a concern. To put it in to context, it takes far, far longer to call a sub or start the regex engine.

What follows covers this in detail.

First of all, \w isn't the same as [a-zA-Z0-9_] by default. \w matches every alphabetic, numeric, mark and connector punctuation Unicode Code Point. There are 119,821 of these!^[1] Determining which is the fastest of non-equivalent code makes no sense.

However, using \w with /aa ensures that \w only matches [a-zA-Z0-9_]. So that's what we're going to be using for our benchmarks. (Actually, we'll use both.)

(Note that each test performs 10 million checks, so a rate of 10.0/s actually means 10.0 million checks per second.)

ASCII-only positive match Rate [a-zA-Z0-9_] (?u:\w) (?aa:\w) [a-zA-Z0-9_] 39.1/s -- -26% -36% (?u:\w) 52.9/s 35% -- -13% (?aa:\w) 60.9/s 56% 15% --

When finding a match in ASCII characters, ASCII-only \w and Unicode \w both beat the explicit class.

/\w/aa is ( 1/39.1 - 1/60.9 ) / 10,000,000 = 0.000,000,000,916 s faster on my machine

ASCII-only negative match Rate (?u:\w) (?aa:\w) [a-zA-Z0-9_] (?u:\w) 27.2/s -- -0% -12% (?aa:\w) 27.2/s 0% -- -12% [a-zA-Z0-9_] 31.1/s 14% 14% --

When failing to find a match in ASCII characters, the explicit class beats ASCII-only \w.

/[a-zA-Z0-9_]/ is ( 1/27.2 - 1/31.1 ) / 10,000,000 = 0.000,000,000,461 s faster on my machine

Non-ASCII positive match Rate (?u:\w) [a-zA-Z0-9_] (?aa:\w) (?u:\w) 2.97/s -- -100% -100% [a-zA-Z0-9_] 3349/s 112641% -- -9% (?aa:\w) 3664/s 123268% 9% --

Whoa. This tests appears to be running into some optimization. That said, running the test multiple times yields extremely consistent results. (Same goes for the other tests.)

When finding a match in non-ASCII characters, ASCII-only \w beats the explicit class.

/\w/aa is ( 1/3349 - 1/3664 ) / 10,000,000 = 0.000,000,000,002,57 s faster on my machine

Non-ASCII negative match Rate (?u:\w) [a-zA-Z0-9_] (?aa:\w) (?u:\w) 2.66/s -- -9% -71% [a-zA-Z0-9_] 2.91/s 10% -- -68% (?aa:\w) 9.09/s 242% 212% --

When failing to find a match in non-ASCII characters, ASCII-only \w beats the explicit class.

/[a-zA-Z0-9_]/ is ( 1/2.91 - 1/9.09 ) / 10,000,000 = 0.000,000,002,34 s faster on my machine

Conclusions

I'm surprised there's any difference between /\w/aa and /[a-zA-Z0-9_]/.

In some situation, /\w/aa is faster; in others, /[a-zA-Z0-9_]/.

The difference between /\w/aa and /[a-zA-Z0-9_]/ is very minimal (less than 1 nanosecond).

The difference is so minimal that you shouldn't be concerned about it.

Even the difference between /\w/aa and /\w/u is quite small despite the latter matching 4 orders of magnitude more characters than the former.

use strict; use warnings; use feature qw( say ); use Benchmarks qw( cmpthese ); my %pos_tests = ( '(?u:\\w)' => '/^\\w*\\z/u', '(?aa:\\w)' => '/^\\w*\\z/aa', '[a-zA-Z0-9_]' => '/^[a-zA-Z0-9_]*\\z/', ); my %neg_tests = ( '(?u:\\w)' => '/\\w/u', '(?aa:\\w)' => '/\\w/aa', '[a-zA-Z0-9_]' => '/[a-zA-Z0-9_]/', ); $_ = sprintf( 'use strict; use warnings; our $s; for (1..1000) { $s =~ %s }', $_) for values(%pos_tests), values(%neg_tests); local our $s; say "ASCII-only positive match"; $s = "J" x 10_000; cmpthese(-3, \%pos_tests); say ""; say "ASCII-only negative match"; $s = "!" x 10_000; cmpthese(-3, \%neg_tests); say ""; say "Non-ASCII positive match"; $s = "\N{U+0100}" x 10_000; cmpthese(-3, \%pos_tests); say ""; say "Non-ASCII negative match"; $s = "\N{U+2660}" x 10_000; cmpthese(-3, \%neg_tests);

Unicode version 11.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

RegEx字表现：\ w vs [a-zA-Z0-9_] c# perl php
2019-04-16 01:40

回答 3 已采纳 [ This answer is Perl-specific. The information within may not apply to PCRE or the engine used by
[a-zA-Z0-9\-_αβγμΓé@$&#]+ python sqlite 有问必答
2022-07-09 09:51

回答 3 已采纳 [[a-zA-Z]+0-9\-_αβγμΓé@$&#]+
symfony2实体验证regexp a-z A-Z 0-9 php symfony
2013-10-31 08:42

回答 2 已采纳 You should use the native Regex validator, It's as simple as using the Regex assert annotation as
Java-正则表达式01
2022-06-07 01:43

程序员飞扬的博客 //匹配数字 //Pattern pattern = Pattern.compile("([a-zA-Z]+)|([0-9]+)");//匹配字母加数字 Pattern pattern = Pattern.compile("\\d+\\.\\d+\\.\\d+\\.\\d");//匹配Ip //2. 创建一个匹配器对象 Matcher matcher =...
[a-zA-Z] +会删除任何xss攻击的机会吗？ php xss
2013-10-08 23:32

回答 1 已采纳 There's 2 sides to this question. First off: yes of course, if there's no way to 'break out of co
正则表达式中的a-z-A-Z是什么意思？
2015-05-09 20:12

回答 2 已采纳 A dash in a character class in a place where it cannot be interpreted as a range is interpreted as
litespeed - 致命错误：调用未定义的函数curl_init（） centos php
2017-01-18 07:54

回答 2 已采纳 I solve my problem. I should just add this to Configure Parameters. '--exec-prefix=/usr' '--
【Java学习笔记】73 - 正则表达式
2023-11-28 21:51

yinhai1114的博客 //提取文章中所有的数字 //Pattern pattern = Pattern.compile("([0-9]+)|([a-zA-Z]+)");//提取文章中所有的英文单词和数字 //Pattern pattern = Pattern.compile(" 1.给你一个字符串(或文章),请你找出所有四个数字...
检查文本字段是否包含A-Z和＆ php
2017-05-19 12:43

回答 2 已采纳 You need something like this $sender_t = '&&'; if (!preg_match('/[^A-Za-z&]/', $sender_t)) { e
PHP：未定义的偏移量：3 - preg_match_all php
2016-04-24 23:21

回答 2 已采纳 Try this (?<=vs-)(.*?)(?=-vs)|(?<=\/)([^\/]*?)(?=-vs)|(?<=vs-)(.*?)(?=\/|$) Regex demo
RegEx用于匹配任何char，只有white-space php
2019-05-13 19:48

回答 2 已采纳 You can do something like (?:Data|Field)\h*\S.* to require an \S (non white-space character) af
Python 正则表达模块详解
2019-08-04 11:13

微软技术分享的博客 \w 匹配数字或字符,匹配范围[A-Za-z0-9] \W 匹配非字符或数字,匹配范围非[A-Za-z0-9] s 匹配空白字符,例如匹配re.search("\s+","ab\tc1\n3").group()结果为'\t' 正则符号(.): 匹配除了换行符以外的任意一个字符,一个...
如何防止[a-z]类型的正则表达式匹配控制字符 php
2017-05-10 08:46

回答 1 已采纳 The $ anchor may match at the end of the string, or before a final newline in a string. Use a /D
B站韩顺平java学习笔记（二十五）-- 正则表达式章节
2022-09-27 19:39

一颗毛李子的博客 IBM、Apple、DEC、Adobe、HP、Oracle、Netscape和微软" + // "等各大公司都纷纷停止了自己的相关开发项目，竞相购买了Java使用许可证，并为自己的产" + // "品开发了相应的Java平台"; // String content = " \n" + /...
java零基础Ⅲ-- 7.正则表达式
2022-03-09 18:12

weixin_42469070的博客 //提取文章中所有的英文单词和数字 //Pattern pattern = Pattern.compile("([0-9]+)|([a-zA-Z]+)"); //提取百度热榜标题 //Pattern pattern = Pattern.compile(" //提取ip地址 Pattern pattern = Pattern.compile...
没有解决我的问题, 去提问

悬赏问题

¥50 易语言把MYSQL数据库中的数据添加至组合框
¥20 求数据集和代码#有偿答复
¥15 关于下拉菜单选项关联的问题
¥20 java-OJ-健康体检
¥15 rs485的上拉下拉，不会对a-b<-200mv有影响吗，就是接受时，对判断逻辑0有影响吗
¥15 使用phpstudy在云服务器上搭建个人网站
¥15 应该如何判断含间隙的曲柄摇杆机构，轴与轴承是否发生了碰撞？
¥15 vue3+express部署到nginx
¥20 搭建pt1000三线制高精度测温电路
¥15 使用Jdk8自带的算法，和Jdk11自带的加密结果会一样吗，不一样的话有什么解决方案，Jdk不能升级的情况

RegEx字表现：\ w vs [a-zA-Z0-9_]

3条回答 默认 最新

悬赏问题

3条回答默认最新