在unicode模式下preg_split：delim_capture不工作？

I'm trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:

the fullwidth full stop 。(0x3002)
the fullwidth question mark ？(0xFF1F)
the fullwidth exclamation mark ！(0xFF01)

Now, let's say my $str is this: $str = "你好。你好吗？我是程序员，不太懂这个我问题，希望大家能够帮忙！一起加油吧！";

I use preg_split with these parameters:

$str2 = preg_split("/([\x{3002}\x{FF01}\x{FF1F}])/u",$str,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

$str2 is now an array that looks like this:

array(3) { [0]=> string(6) "你好" [1]=> string(9) "你好吗" [2]=> string(91) " 我是程序员，不太懂这个我问题，希望大家能够帮忙！一起加油吧！" }

However, the expected output is:

[0] "你好。" 
[1] "你好吗？"
[2] "我是程序员，不太懂这个我问题，希望大家能够帮忙！"
[3] "一起加油吧！"

As you can see, there are two problems: this does not process exclamation marks properly, and second, my fullwidth full stop and fullwidth question marks vanish. I'd expect delim_capture to keep them. I've been looking at this code for so long I can't possibly figure out what the problem is anymore. I would very much appreciate suggestions.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

doucan8521 2012-02-02 15:29

关注

Your regex code should be like this to be able to capture string + delimiter:

$str = "你好。你好吗？ 我是程序员，不太懂这个我问题，希望大家能够帮忙！一起加油吧！";
$arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
                  $str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
var_dump($arr);

OUTPUT:

 array(4) {
  [0]=> string(9)  "你好。"
  [1]=> string(13) "你好吗？ "
  [2]=> string(72) "我是程序员，不太懂这个我问题，希望大家能够帮忙！"
  [3]=> string(18) "一起加油吧！"
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

在unicode模式下preg_split：delim_capture不工作？ php
2012-02-02 14:36

回答 2 已采纳 Your regex code should be like this to be able to capture string + delimiter: $str = "你好。你好吗？我是程
带有PREG_SPLIT_DELIM_CAPTURE标志的preg_split仍然会使用分隔符 php
2014-06-29 06:20

回答 1 已采纳 To capture the delimiters, you still need to wrap them into parentheses: preg_split('~(/)~', '/pr
PHP preg_split将分隔符保存在不同的元素中 php
2018-10-19 05:17

回答 2 已采纳 This will get you pretty close $page_content = 'the quick brown fox [[random text here]] and the
php 正则英文名_PHP正则表达式简介
2021-03-22 19:25

王羽翊的博客 POSIX风格的正则表达式更容易掌握，但不能安全用于二进制模式，而Perl兼容的正则表达式相对比较复杂。正则表达式就是有普通字符(如a~z)和特殊字符(称为元字符)组成的字符串模式。使用正则表达式可以完成以下功能：①...
如何在不丢失角色的情况下preg_split？ php
2018-05-07 15:00

回答 4 已采纳 You might use a word boundary \b: \b;\b $string = "Hello; how are you;Hey, I am fine"; $new = pr
PHP preg_split或preg_match句子但在Array中保留标点符号 php
2016-05-30 07:59

回答 1 已采纳 You could do what you want using preg_match: $meta = 'I am looking to break this paragraph into c
如何在php中使用preg_split（）？ php
2014-06-12 16:42

回答 4 已采纳 preg means Pcre REGexp", which is kind of redundant, since the "PCRE" means "Perl Compatible Regex
PHP正则表达式，看这一篇就够啦！
2020-07-05 14:11

我叫张小辫er的博客基本语法界定符：标识一个正则表达式的开始和结束，用’/‘或’#‘或’{ }’,因为语法’{ }'也可能是正则表达式的运算符，为了避免混淆...不可见原子：Unicode编码表中可用键盘输出后肉眼不可见的字符，例如：换行符
PHP preg_split匹配除空行之外的每一行 php
2019-02-08 03:54

回答 1 已采纳 Use the PREG_SPLIT_NO_EMPTY flag: $string = "Name|name Last|f_name "; print_r(preg_split('/\R/',
PHP preg_split存储成多个变量 php
2017-07-13 20:21

回答 1 已采纳 You have to explode each of your results again inside the loop. $string = '968:-50px, 750:-300px,
preg_match：应该什么都不匹配？ php
2011-08-19 23:40

回答 2 已采纳 There are actually several nothings in the string "test". They are (at a minimum, see my aside bel
php正规表达法,常用的php正则表达及语法注解总结
2021-04-20 15:44

郭福临的博客 PHP是在服务器端执行的脚本语言，与C语言类似，是常用的网站编程语言。PHP独特的语法混合了C、Java、Perl以及 PHP 自创的语法。下面是常用的php正则表达及语法注解总结，让我们一起来看看常用的php正则表达及语法...
php preg_split第一个大写 php
2014-02-17 22:49

回答 2 已采纳 RTM my friend, as per documentation of preg_split you have also a $limit parameter so the answer i
Mbstring.php
2022-02-16 15:06

fareast_mzh的博客 Partial mbstring implementation in PHP, iconv based, UTF-8 centric.
php 两个单词正则表达式字符前_如何将PHP中的字符串截断为最接近一定数量字符的单词？...
2021-01-13 03:57

Microsoft俱乐部的博客如何将PHP中的字符串截断为最接近一定数量字符的单词？... 我可以使用substr()来切断200个字符的文本，但结果会在单词中间切断 - 我真正想要的是在200个字符之前在最后一个单词的末尾剪切文本。Brian asked 2...
PHP7正则表达式,精通PHP正则表达式，看这一篇就够啦！
2021-03-24 09:58

曲绿意的博客 ## 前言很多人看正则表达式就像看天文数字一样，电话号码、邮箱的正则表达式，上网复制一下粘贴下来就搞定了。完全不知道为什么这么写。...那么ab,abbb,abbbbb都符合这个特征，而字符串ac显然不符合在...
imagick php 中文,php – 如何使用imagick annotateImage中文文本？
2021-04-21 07:26

Tim Ji的博客 $strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); $linesArr = array(); $lineHeight = 0; $goodLine = ''; $spacePending = false; foreach ($strArr as $str) { ...
php有几个单词,PHP-如何选择一个句子的前10个单词？
2021-04-09 11:07

weixin_29623163的博客 PHP-如何选择一个句子的前10个单词？我如何从输出中仅选择前10个字？AAA asked 2020-02-15T07:42:50...为了增加对其他分词符(例如逗号和破折号)的支持，\w提供了一种快速的方法，不需要拆分字符串：function get_w...
达内培训php怎么样,深圳达内php培训到底怎么样说说亲身经历感受
2021-03-23 22:23

我不上层楼了的博客基本语法界定符：标识一个正则表达式的开始和结束，用'/'或'#'或'{...或者英文字母，汉字等等可见字符不可见原子：Unicode编码表中可用键盘输出后肉眼不可见的字符，例如：换行符n，Tab制表符t,空格等等，一般只用这...
ajax在Xss中的利用,XSS高级利用
2021-08-06 20:38

asta谢的博客 if($xss!==null){echo $xss;}?>这段代码中首先包含一个表单，用于向页面自己发送 GET 请求，带一个名为 xss ... 然后 PHP 会读取该参数，如果不为空，则直接打印出来，这里不存在任何过滤。也就是说，如果 xss 中...
没有解决我的问题, 去提问

悬赏问题

¥15 oracle集群安装出bug
¥15 关于#python#的问题：自动化测试
¥20 问题请教！vue项目关于Nginx配置nonce安全策略的问题
¥15 教务系统账号被盗号如何追溯设备
¥20 delta降尺度方法，未来数据怎么降尺度
¥15 c# 使用NPOI快速将datatable数据导入excel中指定sheet，要求快速高效
¥15 再不同版本的系统上，TCP传输速度不一致
¥15 高德地图点聚合中Marker的位置无法实时更新
¥15 DIFY API Endpoint 问题。
¥20 sub地址DHCP问题

码龄粉丝数原力等级 --

在unicode模式下preg_split：delim_capture不工作？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

在unicode模式下preg_split：delim_capture不工作？

2条回答 默认 最新

悬赏问题

2条回答默认最新