如何使用preg_match在多字节字符串中获取正确的列表位置

I am currently matching HTML using this code:

preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)

It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.

For example the returned $match array would give something like:

array
  0 => 
    array
      0 => string '<br />' (length=6)
      1 => int 132
  1 => 
    array
      0 => string 'br' (length=2)
      1 => int 133

The real number for the <br /> match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

duanchanguo7603 2012-04-02 14:05

关注

I looked at this suggestion from @Qtax:

UTF-8 characters in preg_match_all (PHP)

And for some more reference, this bug surfaced while using this: Truncate text containing HTML, ignoring tags

The gist of the change is this:

$orig_utf = 'UTF-8';
$new_utf  = 'UTF-32';

mb_regex_encoding( $new_utf );

$html     = mb_convert_encoding( $html, $new_utf, $orig_utf );
$end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );


mb_ereg_search_init( $html );

$pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
$pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );

while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {

  $tag_position = $tag_match[0]/4;
  $tag_length   = $tag_match[1];
  $tag          = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
  $tag_name     = preg_replace( '/[\s<>\/]+/', '', $tag );

  // Print text leading up to the tag.
  $str = mb_substr($html, $position, $tag_position - $position, $new_utf );

  .......

}

Also in reference to the truncate HTML page, there are other neccessary changes:

$first_char = mb_substr( $tag, 0, 1, $new_utf );

if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
  ...
}

My text editor is UTF-8 so if I was comparing the 32 to my file's ampersand, it wouldn't work.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(2条)

报告相同问题？

关注问题

如何使用preg_match在多字节字符串中获取正确的列表位置 php
2012-03-30 21:51

回答 3 已采纳 I looked at this suggestion from @Qtax: UTF-8 characters in preg_match_all (PHP) And for some mo
Preg_match_all从字符串中获取值 php
2014-06-09 22:35

回答 1 已采纳 A pragmatic attempt in this case would be to match everything between the = and the </br>:
使用正则表达式和php preg_match_all在括号之间获取字符串 php
2017-07-14 12:34

回答 2 已采纳 This method will extract your desired substrings and prepare the output data as you have requested
PHP中preg_match正则匹配中的/u、/i、/s含义
2021-01-20 00:16

PHP中preg_match正则匹配的/u /i /s是什么意思 /u 表示按unicode(utf-8)匹配（主要针对多字节比如汉字） /i 表示不区分大小写（如果表达式里面有 a，那么 A 也是匹配对象） /s 表示将字符串视为单行来匹配您...
在preg_match中使用多行字符串 php
2014-02-05 19:20

回答 4 已采纳 You need the s modifier to match over multiple lines, see the manual: ... |s",$website,$matches);
如何在php中使用preg_match删除特定的字符串 php
2016-11-09 13:47

回答 2 已采纳 Try this: $newstr = preg_replace("/(?:\w\.|\w\w\.)/", "$2", $variable); It will remove one or t
php preg_match在限制内找到字符串里面的字符串 php
2018-01-30 18:30

回答 2 已采纳 Regex: "([a-z0-9]{32})" or (?<=")[a-z0-9]{32}(?=") $text = "pagination},queryId:\"472f257a40c6
win2003下PHP使用preg_match_all导致apache崩溃问题的解决方法
2021-01-20 15:19

小编的平台是windows server 2003（32位系统） + Apache/2.2.9 (Win32) + PHP/5.2.17，在使用正则表达式 preg_match_all （如 preg_match_all(“/ni(.*?)wo/”, $html, $matches);）进行分析匹配比较长的字符串 $...
php中如何用preg_match_all匹配字符串 php
2015-04-01 01:42

回答 4 已采纳 preg_match_all($reg,$str.$data) $reg 为你匹配的正则表达式 $str 为要匹配的字符串 $data 为匹配到的数组首先$str = file_get_c
使用preg_match_all从字符串中提取Image SRC php
2012-09-16 21:38

回答 3 已采纳 Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before
php使用preg_match在href标签内获取2个字符串 php
2014-08-26 12:07

回答 1 已采纳 This should be an easy one: <?php if (preg_match("#/video/([a-z\-]+)-([0-9]+)/#", "/video/fun
php正则匹配指定字符串，获取截取指定内容，preg_match使用实例
2019-12-13 22:44

sh2018的博客一、匹配简单的 (.*?) //匹配AAA和BBB之间的内容 $isMatched = preg_match("/AAA(.*?)BBB/", $fcontents, $matches...//匹配多个字符串 $isMatched = preg_match("/AAA(.*?)BBB/", $fcontents, $matches44); $isMatc...
PHP用preg_match_all正则多个关键字怎么写? php
2017-11-30 05:36

回答 8 已采纳 []改为() ``` $pattaern0='/(你好|中国|国家|新年|娱乐|程序|羁绊|www\\.baidu\\.com|google)+/u'; ```
php中preg_match怎么用,php中的preg_match()函数如何使用
2021-04-16 13:07

Wonder王达的博客在php中preg_match()函数用于执行一个正则表达式匹配，并返回匹配的次数，该函数在第一次匹配后会停止搜索。函数语法：【int preg_match(string $pattern ,string $subject)】。在php中preg_match 函数用于执行一个...
php中preg_match怎么用,php中的preg_match()函数怎样运用_后端开发
2021-04-26 14:16

每天痛苦与更好的的博客在php中preg_match()函数用于实行一个正则表达式婚配，并返回婚配的次数，该函数在第一次婚配后会住手搜刮。函数语法：【int preg_match(string $pattern ,string $subject)】。在php中preg_match 函数用于实行一个...
php preg_match 只匹配第一个字符_深入解析sprintf格式化字符串漏洞
2020-10-20 21:26

weixin_39758953的博客 0x01 sprintf()讲解首先我们先了解sprintf()函数sprintf() 函数把格式化的字符串写入变量中。sprintf(format,arg1,arg2,arg++)arg1、arg2、++ 参数将被插入到主字符串中的百分号(%)符号处。该函数是逐步执行的。在第...
php中文字符串提取方法,preg_replace 和preg_match_all区别
2023-05-30 21:24

qikexun的博客如果函数 preg_replace() 搜索到匹配项，则会返回被替换后的 $subject，否则返回...如果 $subject 是一个数组，preg_replace() 函数会返回一个数组，其他情况下返回一个字符串。join() 函数是 implode() 函数的别名。
php 字符串异或绕过,CTF中php异或绕过preg_match
2021-04-22 00:19

半瓶榴莲奶的博客 0x00:写在前面suctf的题目和强网杯都遇到这种类型题目了，正好就当做一个笔记来记录一下。$hhh= @$_GET[‘_‘];if(!$hhh){highlight_file(__FILE__);...}if ( preg_match(‘/[\x00- 0-9A-Za-z\‘...
PHP中preg_match_all正则匹配出需要的内容
2020-08-19 11:45

夏已微凉、的博客目录一、需求二、分析1、共同特征2、详细分析1、匹配数字2、匹配英文问号：0个或1个3、匹配量词中的一个【桶，盒，对，只，根，条】4、匹配空格0个或多个5、针对汉字匹配 /u3、正则表达式三、代码四、打印五、正则...
preg_match和preg_match_all()
2019-06-06 01:39

清风169的博客 preg_match—执行匹配正则表达式 preg_match(string$pattern,string$subject[,array&$matches[,int$flags= 0[,int$offset= 0]]] ) :int 参数是否必须说明 pattern 是要搜索的模式 subject ...
没有解决我的问题, 去提问

悬赏问题

¥15 在获取boss直聘的聊天的时候只能获取到前40条聊天数据
¥20 关于URL获取的参数，无法执行二选一查询
¥15 液位控制，当液位超过高限时常开触点59闭合，直到液位低于低限时，断开
¥15 marlin编译错误，如何解决？
¥15 有偿四位数，节约算法和扫描算法
¥15 VUE项目怎么运行，系统打不开
¥50 pointpillars等目标检测算法怎么融合注意力机制
¥20 Vs code Mac系统 PHP Debug调试环境配置
¥60 大一项目课，微信小程序
¥15 求视频摘要youtube和ovp数据集

码龄粉丝数原力等级 --

如何使用preg_match在多字节字符串中获取正确的列表位置

3条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

如何使用preg_match在多字节字符串中获取正确的列表位置

3条回答 默认 最新

悬赏问题

3条回答默认最新