dongliang2005 2012-03-30 21:51
浏览 89
已采纳

如何使用preg_match在多字节字符串中获取正确的列表位置

I am currently matching HTML using this code:

preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)

It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.

For example the returned $match array would give something like:

array
  0 => 
    array
      0 => string '<br />' (length=6)
      1 => int 132
  1 => 
    array
      0 => string 'br' (length=2)
      1 => int 133

The real number for the <br /> match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.

  • 写回答

3条回答 默认 最新

  • duanchanguo7603 2012-04-02 14:05
    关注

    I looked at this suggestion from @Qtax:

    UTF-8 characters in preg_match_all (PHP)

    And for some more reference, this bug surfaced while using this: Truncate text containing HTML, ignoring tags

    The gist of the change is this:

    $orig_utf = 'UTF-8';
    $new_utf  = 'UTF-32';
    
    mb_regex_encoding( $new_utf );
    
    $html     = mb_convert_encoding( $html, $new_utf, $orig_utf );
    $end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );
    
    
    mb_ereg_search_init( $html );
    
    $pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
    $pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );
    
    while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {
    
      $tag_position = $tag_match[0]/4;
      $tag_length   = $tag_match[1];
      $tag          = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
      $tag_name     = preg_replace( '/[\s<>\/]+/', '', $tag );
    
      // Print text leading up to the tag.
      $str = mb_substr($html, $position, $tag_position - $position, $new_utf );
    
      .......
    
    } 
    

    Also in reference to the truncate HTML page, there are other neccessary changes:

    $first_char = mb_substr( $tag, 0, 1, $new_utf );
    
    if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
      ...
    }
    

    My text editor is UTF-8 so if I was comparing the 32 to my file's ampersand, it wouldn't work.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 在获取boss直聘的聊天的时候只能获取到前40条聊天数据
  • ¥20 关于URL获取的参数,无法执行二选一查询
  • ¥15 液位控制,当液位超过高限时常开触点59闭合,直到液位低于低限时,断开
  • ¥15 marlin编译错误,如何解决?
  • ¥15 有偿四位数,节约算法和扫描算法
  • ¥15 VUE项目怎么运行,系统打不开
  • ¥50 pointpillars等目标检测算法怎么融合注意力机制
  • ¥20 Vs code Mac系统 PHP Debug调试环境配置
  • ¥60 大一项目课,微信小程序
  • ¥15 求视频摘要youtube和ovp数据集