dtsjq28482 2012-02-02 14:36
浏览 96
已采纳

在unicode模式下preg_split:delim_capture不工作?

I'm trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:

  • the fullwidth full stop 。(0x3002)
  • the fullwidth question mark ?(0xFF1F)
  • the fullwidth exclamation mark !(0xFF01)

Now, let's say my $str is this: $str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";

I use preg_split with these parameters:

$str2 = preg_split("/([\x{3002}\x{FF01}\x{FF1F}])/u",$str,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

$str2 is now an array that looks like this:

array(3) { [0]=> string(6) "你好" [1]=> string(9) "你好吗" [2]=> string(91) " 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!" }

However, the expected output is:

[0] "你好。" 
[1] "你好吗?"
[2] "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
[3] "一起加油吧!"

As you can see, there are two problems: this does not process exclamation marks properly, and second, my fullwidth full stop and fullwidth question marks vanish. I'd expect delim_capture to keep them. I've been looking at this code for so long I can't possibly figure out what the problem is anymore. I would very much appreciate suggestions.

  • 写回答

2条回答 默认 最新

  • doucan8521 2012-02-02 15:29
    关注

    Your regex code should be like this to be able to capture string + delimiter:

    $str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";
    $arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
                      $str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
    var_dump($arr);
    

    OUTPUT:

     array(4) {
      [0]=> string(9)  "你好。"
      [1]=> string(13) "你好吗? "
      [2]=> string(72) "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
      [3]=> string(18) "一起加油吧!"
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 oracle集群安装出bug
  • ¥15 关于#python#的问题:自动化测试
  • ¥20 问题请教!vue项目关于Nginx配置nonce安全策略的问题
  • ¥15 教务系统账号被盗号如何追溯设备
  • ¥20 delta降尺度方法,未来数据怎么降尺度
  • ¥15 c# 使用NPOI快速将datatable数据导入excel中指定sheet,要求快速高效
  • ¥15 再不同版本的系统上,TCP传输速度不一致
  • ¥15 高德地图点聚合中Marker的位置无法实时更新
  • ¥15 DIFY API Endpoint 问题。
  • ¥20 sub地址DHCP问题