特殊äö字符打破UTF-8编码

A user on my site inputted special characters into a text field: ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.

What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douniewei6346 2019-02-28 13:37
关注
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.

Commonly used accented letters have their own code points in the Unicode standard, in this case:

U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"

U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"

However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:

U+0308 "COMBINING DIAERESIS"

When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.

As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:

Normalization Form D (NFD): Canonical Decomposition

Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition

Normalization Form KD (NFKD): Compatibility Decomposition

Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

Ignoring the "Compatibility" forms for now, we have two options:

Decomposition, which uses combining diacritics as often as possible

Composition, which uses specific code points as often as possible

So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.

However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.

You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

特殊äö字符打破UTF-8编码 php
2019-02-28 13:22

回答 1 已采纳 It's not that these characters have broken the encoding, it's just that Unicode is really complica
PHP UTF-8 mb_convert_encode和Internet-Explorer php
2015-07-15 12:56

回答 2 已采纳 Although I prefer using urlencoded strings in address bar but for your case you can try to encode
获取UTF-8文件名以使用PHP ZipArchive php
2014-02-27 12:27

回答 1 已采纳 I have found an answer (sort of). In the example above $encoded_filename was changed from UTF-8 to
UTF8-fix:使用UTF-8字符修复MySQL数据库-开源
2021-05-08 03:42

如果您应该看到“Schülerkönnenmähen”，但是却看到诸如“Schülerkönnenmähen”之类的东西，则应该修复数据库。 UTF8修复程序可以转换您SQL转储。
带有特殊字符的PHP strtolower（如：äüö） php
2014-11-08 18:01

回答 1 已采纳 Call mb_internal_encoding first to tell the multibyte functions what encoding you want them to ope
使用fgetcsv读取CSV文件时出现UTF-8问题 php
2012-01-16 15:23

回答 6 已采纳 Now I got it working (after removing the header command). I think the problem was that the encodin
UTF8字符显示不正确[重复] html mysql php
2013-05-20 12:21

回答 1 已采纳 If you have written html meta tag as charset=UTF-8 and you have set Collation as utf8_unicode_ci c
java httpurlconnection 设置编码_java – 通过HttpURLConnection发送UTF-8字符失败
2021-03-13 02:23

weixin_39943586的博客我现在已经花了一半星期天,我现在需要帮助：我想使用Java HttpURLConnection将包含特殊字符UTF-8编码的字符串发送到服务器.字符的正确编码失败.例：strToSend: ä ù €strUrlEncoded: %C3%A4+%C3%B9+%E2%82%...
来自xml的php utf-8解码返回问号 php xml
2013-04-30 20:11

回答 1 已采纳 Okay the following is now a bit rough/verbose, especially as you already tried so much. Just try t
PHP替换à-> a，è-> e等特殊字符 php
2012-04-14 10:32

回答 7 已采纳 There's a much easier way to do this, using iconv - from the user notes, this seems to be what you
PHP JSON编码为utf8并打印它 json php
2015-12-03 12:20

回答 1 已采纳 Try this: print json_encode($output, JSON_UNESCAPED_UNICODE); Read more about json_encode() par
java http utf8_java – 通过HttpURLConnection发送UTF-8字符失败
2021-03-20 09:08

weixin_39525007的博客我现在已经花了一半星期天,我现在需要帮助：我想使用Java HttpURLConnection将包含特殊字符UTF-8编码的字符串发送到服务器.字符的正确编码失败.例：strToSend: ä ù €strUrlEncoded: %C3%A4+%C3%B9+%E2%82%...
json_encode（）UTF-8错误 mysql php
2012-05-17 11:57

回答 3 已采纳 Your output is correct; that's how you're supposed to embed unicode characters in JSON.
android 支持的字符编码,Android Studio：用于编码UTF-8的不可映射字符(Android Studio : unmappable character for encoding U...
2021-05-26 06:45

weixin_39851918的博客用于编码UTF-8的不可映射字符(Android Studio : unmappable character for encoding UTF-8)将我的项目从eclipse导入到android studio后，我遇到以下错误：Error: unmappable character for encoding UTF-8Android ...
当页面编码使用utf-8编码时，如何转换成中文？
2018-05-01 15:13

腾阳的博客查看了网页之后才知道这是因为网页是使用的utf-8编码的。如果使用的是谷歌浏览器，直接点击F12即可查看。否则直接右键后检查元素或者使用查看网页的源代码。在网页的源代码的头文件最上面，我们可以看到文本的格式...
GBK UTF-8 ASCLL url编码集合
2021-11-29 19:50

IP-_的博客 UTF-8 字符集 ASCLL ASCII 字符集 ASCII 可打印的字符 ASCII 设备控制字符 URL编码编码规则编码表 GBK GBK全称《汉字内码扩展规范》（GBK即“国标”、“扩展”汉语拼音的第一个字母，英文名称：Chinese ...
vba判断文件编码格式_utf 8-保存用VBA编码的文本文件UTF-8
2020-12-22 17:38

weixin_39817122的博客 utf 8-保存用VBA编码的文本文件UTF-8我怎样才能从VBA将UTF-8编码的字符串写入文本文件，例如Dim fnum As Integerfnum = FreeFileOpen "myfile.txt" For Output As fnumPrint #fnum, "special characters: äöüß" ...
mysql my.ini utf8_mysql默认编码为UTF-8 通过修改my.ini实现方法
2021-01-27 17:06

回头看看我的博客通常，字符äåö没问题，...要使UTF-8在Java + Tomcat + Linux / Windows + Mysql下工作，需要满足以下条件：配置Tomcat的server.xml 必须配置连接器使用UTF-8编码url(GET请求)参数：在上面的示例中，关键部分是UR...
python中把ISO-8859-1编码转化为UTF-8
2020-08-27 15:07

FM黎明之前的博客 ISO-8859-1转换 UTF-8 应用场景：这几天在做微信OAuth2.0授权登录，遇到的问题。爬取一些数据的时候一定也会遇到这样的问题。这里我拿微信返回用户个人信息来举例： {'openid': 'oGl2QwQ07wZRyJVu0t57y1CaVlg4'...
判断中文文本是否为utf8编码类型的JavaScript实现
2019-04-27 16:17

qq_43376332的博客常用汉字的unicode编码范围为4E00-9FA5，此范围被包含于UTF-8 3字节编码范围内。故若文本由UTF-8编码时，一个汉字将由三个字节组成。而这三个字节的第一个的范围将为：1110 0100 - 1110 1001。使用FileReader....
没有解决我的问题, 去提问

悬赏问题

¥15 ads仿真结果在圆图上是怎么读数的
¥20 Cotex M3的调试和程序执行方式是什么样的？
¥20 java项目连接sqlserver时报ssl相关错误
¥15 一道python难题3
¥15 用matlab 设计一个不动点迭代法求解非线性方程组的代码
¥15 牛顿斯科特系数表表示
¥15 arduino 步进电机
¥20 程序进入HardFault_Handler
¥15 oracle集群安装出bug
¥15 关于#python#的问题：自动化测试

特殊äö字符打破UTF-8编码

1条回答 默认 最新

悬赏问题

1条回答默认最新