宣言使PHP脚本完全符合Unicode

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.

The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.

Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):

The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).

For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongtu7567 2011-04-23 15:43
关注
That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.

One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example, mb_substr() is called instead of substr() if function overloading is enabled.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

宣言使PHP脚本完全符合Unicode php
2011-04-23 15:33

回答 2 已采纳 That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one y
致命错误：Temando宣言 - Magento 2.3 php
2019-01-07 17:13

回答 1 已采纳 I had the same pblm when I upgraded to Magento 2.3.0 from 2.2.6. Solution: completely delete the
初学 C++ 写了一半的代码已经迷失了。谁能帮我看一下
2016-10-24 07:06

回答 1 已采纳 int Input(char* pword); void Reverse(char* pWord , int nLengthWord); void Output(char* pWord, int
工作中使用到的单词（软件开发）_2023_0316备份
2023-03-16 10:55

sun0322的博客３７．V字モデル　ウォーターフォールモデル　瀑布开发中，的V字模型扩展 W字模型软件测试的W模型和V模型_小心眼儿猫的博客-CSDN博客_软件测试w模型软件开发中的V字模型与W模型（使开发与侧试分离）_sun0322-CSDN...
JAVA 关于 cannot resolve method
2016-06-03 13:32

回答 2 已采纳 http://stackoverflow.com/questions/27514338/cannot-resolve-method-showandroid-support-v4-app-fragmen
xctf攻防世界 CRYPTO薪手练习区
2022-02-05 14:47

l8947943的博客混合编码题目给了一串字符，以==结束，尝试base64解码：明显Unicode解码：还没有正确答案，尝试再次base64试试：带斜杠的，明显是ASCII码，脚本处理： s = '/119/101/108/99/111/109/101/116/111/97/116/116/97...
CTF基础解题
2021-09-29 15:58

暴龙振翅飞翔的博客密码学摩尔斯电码 -. … ..-. — -.-.... ...常见码制 Base 64编码： dGVybWluYXRvcg== MD5： 9a85db6a0e0003fe1293737c39acc824 SHA-1： 11fee33453c427b3e6ebabb7d2d2120312c0e7c9 ...HEX编码：74 65 72
Python网络爬虫实战：世纪佳缘爬取近6万条小姐姐数据后发现惊天秘密
2020-06-24 14:40

工程师大胖的博客提取关键信息通过分析上面获取到的 josn 文件，我们可以知道，这里面包含了用户的相当多的信息，包括用户ID，昵称，性别，年龄，身高，照片，学历，城市，择偶标准，以及个性宣言等（不过有些信息在这里是获取不到...
CTF中那些脑洞大开的编码和加密
2020-09-01 00:05

北观止的博客链接 8.Unicode编码 Unicode编码有以下四种编码方式：源文本： The &#x [Hex]： The &# [Decimal]： The \U [Hex]： \U0054\U0068\U0065 \U+ [Hex]： \U+0054\U+0068\U+0065 编码解码链接 9.Escape/Unescape...
相见恨晚的编程学习词典！谁还不是南极滑冰的那个崽儿？！
2020-10-01 06:22

Thesmophoria的博客也可以查看由当前值表示的 ASCII 字符和 Unicode 字符。 AND Bitwise And OR Bitwise Or NOR Bitwise Nor XOR Bitwise Exclusive Or << Left Shift >> Right Shift X< Left Shift Y Bits X>>Y Right Shift Y Bits ...
常见编码和加密算法
2020-05-17 09:56

秦岭熊猫的博客 0x01 目录常见编码: ASCII编码 Base64/32/16编码 shellcode编码 Quoted-printable编码 XXencode编码 UUencode编码 URL编码 Unicode编码 Escape/Unescape编码 HTML实体编码敲击码(Tap code) 莫尔斯电码(Morse Code...
ctf从零到入门0x04：（转载）ctf中最全的（脑洞大开的加密方法）
2019-06-21 23:29

__N4c1__的博客 8.Unicode编码 9.Escape/Unescape编码 10.HTML实体编码 11.敲击码(Tap code) 12.莫尔斯电码(Morse Code) 13.编码的故事各种文本加密换位加密: 1.栅栏密码(Rail-fence Cipher) 2.曲路...
[CTF]中那些脑洞大开的编码和加密
2018-12-11 19:26

神龙云计算的博客 Unicode编码有以下四种编码方式：源文本： The &#x [Hex]： The &# [Decimal]： The \U [Hex]： \U0054\U0068\U0065 \U+ [Hex]： \U+0054\U+...
ctf密码学汇总
2018-10-25 17:30

浅墨微蓝的博客 Unicode编码有以下四种编码方式：源文本： The &#x [Hex]： The &# [Decimal]： The \U [Hex]： \U0054\U0068\U0065 \U+ [Hex]： \U+0054\U+0068\U...
马哥Linux
2018-08-15 22:58

hhhhhyyyyy8的博客 GPL(宣言)：General Public License；gcc：GUN C Complier;GNU/Linux linux基本原则： 1.由目的单一的小程序组成，组合小程序完成复杂任务 2.一切皆文件 3.尽量避免捕获用户接口 4.配置文件保存为纯...
CTF密码学总结
2017-12-09 13:21

weixin_30596343的博客 Unicode编码有以下四种编码方式：源文本： The &#x [Hex]： The &# [Decimal]： The \U [Hex]： \U0054\U0068\U0065 \U+ [Hex]： \U+0054\U+0068\U+0065...
没有解决我的问题, 去提问

悬赏问题

¥100 支付宝网页转账系统不识别账号
¥15 基于单片机的靶位控制系统
¥15 AT89C51控制8位八段数码管显示时钟。
¥15 真我手机蓝牙传输进度消息被关闭了，怎么打开？(关键词-消息通知)
¥15 下图接收小电路，谁知道原理
¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？
¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
¥15 手机接入宽带网线，如何释放宽带全部速度
¥30 关于#r语言#的问题：如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
¥15 ETLCloud 处理json多层级问题

宣言使PHP脚本完全符合Unicode

2条回答 默认 最新

悬赏问题

2条回答默认最新