特殊äö字符打破UTF-8编码

A user on my site inputted special characters into a text field: ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.

What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douniewei6346 2019-02-28 05:37
关注
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.

Commonly used accented letters have their own code points in the Unicode standard, in this case:

U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"

U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"

However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:

U+0308 "COMBINING DIAERESIS"

When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.

As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:

Normalization Form D (NFD): Canonical Decomposition

Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition

Normalization Form KD (NFKD): Compatibility Decomposition

Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

Ignoring the "Compatibility" forms for now, we have two options:

Decomposition, which uses combining diacritics as often as possible

Composition, which uses specific code points as often as possible

So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.

However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.

You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

编辑

预览

报告相同问题？

关注问题

UTF8-fix:使用UTF-8字符修复MySQL数据库-开源
2021-05-07 19:42

标题中的"UTF8-fix:使用UTF-8字符修复MySQL数据库-开源"指的是一种解决方案，用于处理MySQL数据库中UTF-8编码不正确的问题。在数据库操作中，如果字符集设置不当或数据导入过程中出现问题，可能会导致非预期的乱码...
Python 中如何将 UTF-8 字符串转换为 ASCII
2024-07-11 08:04

qq^^614136809的博客在 Python 中，有时候我们需要将 UTF-8 字符串转换为 ASCII 字符串。UTF-8 是一种流行的字符串编码格式，可以表示大多数语言的字符。ASCII 是一种较早的字符串编码格式，只能表示英语字符。
java http utf8_java – 通过HttpURLConnection发送UTF-8字符失败
2021-03-20 01:08

weixin_39525007的博客我现在已经花了一半星期天,我现在需要帮助：我想使用Java HttpURLConnection将包含特殊字符UTF-8编码的字符串发送到服务器.字符的正确编码失败.例：strToSend: ä ù €strUrlEncoded: %C3%A4+%C3%B9+%E2%82%...
java httpurlconnection 设置编码_java – 通过HttpURLConnection发送UTF-8字符失败
2021-03-12 18:23

weixin_39943586的博客我现在已经花了一半星期天,我现在需要帮助：我想使用Java HttpURLConnection将包含特殊字符UTF-8编码的字符串发送到服务器.字符的正确编码失败.例：strToSend: ä ù €strUrlEncoded: %C3%A4+%C3%B9+%E2%82%...
android 支持的字符编码,Android Studio：用于编码UTF-8的不可映射字符(Android Studio : unmappable character for encoding U...
2021-05-25 22:45

weixin_39851918的博客用于编码UTF-8的不可映射字符(Android Studio : unmappable character for encoding UTF-8)将我的项目从eclipse导入到android studio后，我遇到以下错误：Error: unmappable character for encoding UTF-8Android ...
vba判断文件编码格式_utf 8-保存用VBA编码的文本文件UTF-8
2020-12-22 09:38

weixin_39817122的博客 utf 8-保存用VBA编码的文本文件UTF-8我怎样才能从VBA将UTF-8编码的字符串写入文本文件，例如Dim fnum As Integerfnum = FreeFileOpen "myfile.txt" For Output As fnumPrint #fnum, "special characters: äöüß" ...
当页面编码使用utf-8编码时，如何转换成中文？
2018-05-01 07:13

腾阳的博客查看了网页之后才知道这是因为网页是使用的utf-8编码的。如果使用的是谷歌浏览器，直接点击F12即可查看。否则直接右键后检查元素或者使用查看网页的源代码。在网页的源代码的头文件最上面，我们可以看到文本的格式...
python中把ISO-8859-1编码转化为UTF-8
2020-08-27 07:07

FM黎明之前的博客 ISO-8859-1转换 UTF-8 应用场景：这几天在做微信OAuth2.0授权登录，遇到的问题。爬取一些数据的时候一定也会遇到这样的问题。这里我拿微信返回用户个人信息来举例： {'openid': 'oGl2QwQ07wZRyJVu0t57y1CaVlg4'...
mysql my.ini utf8_mysql默认编码为UTF-8 通过修改my.ini实现方法
2021-01-27 09:06

回头看看我的博客通常，字符äåö没问题，...要使UTF-8在Java + Tomcat + Linux / Windows + Mysql下工作，需要满足以下条件：配置Tomcat的server.xml 必须配置连接器使用UTF-8编码url(GET请求)参数：在上面的示例中，关键部分是UR...
判断中文文本是否为utf8编码类型的JavaScript实现
2019-04-27 08:17

qq_43376332的博客常用汉字的unicode编码范围为4E00-9FA5，此范围被包含于UTF-8 3字节编码范围内。故若文本由UTF-8编码时，一个汉字将由三个字节组成。而这三个字节的第一个的范围将为：1110 0100 - 1110 1001。使用FileReader....
GBK UTF-8 ASCLL url编码集合
2021-11-29 11:50

IP-_的博客 UTF-8 字符集 ASCLL ASCII 字符集 ASCII 可打印的字符 ASCII 设备控制字符 URL编码编码规则编码表 GBK GBK全称《汉字内码扩展规范》（GBK即“国标”、“扩展”汉语拼音的第一个字母，英文名称：Chinese ...
Go 学习笔记（31）— 字符串 string、字符 rune、字节 byte、UTF-8 和 Unicode 区别以及获取字符串长度
2020-05-18 14:13

wohu007的博客 Go 语言中字符串的内部实现使用 UTF-8 编码，通过 rune 类型，可以方便地对每个 UTF-8 字符进行访问。当然， Go 语言也支持按照传统的 ASCII 码方式逐字符进行访问。字符串是常量，可以通过类似数组索引访问其字节...
UTF-8和GBK等编码格式转换问题
2018-06-24 12:05

sakura__tears的博客下面我们来研究下UTF-8和GBK等编码格式之间的相互转化。实践在进行编码转换时，我们用ISO-8859-1编码来接受和保存数据，并转换为相应编码。为什么采用ISO-8859-1编码作为中间转存方案呢？下面我们通过程序...
iso88591转utf8 java_编码-将utf8字符转换为iso-88591并返回到PHP
2021-03-19 01:16

weixin_39945178的博客编码-将utf8字符转换为iso-88591并返回到PHP我的某些脚本...所以：有什么简单的方法可以在PHP中将字符串从UTF-8更改为ISO-88591？我看过utf_encode和_decode，但是它们没有做我想要的。为什么不存在任何“ utf2...
php cp936转utf8编码转换乱码问题的解决方案
2020-05-17 07:41

南通SEO的博客 mb_convert_encoding($str, 'UTF-8', 'CP936'); 转换后的结果却是：氓聬聨莽聨掳盲禄拢猫陆禄氓楼垄氓聟篓茅聯聹氓聬聤莽聛炉实际需要的结果是：后现代轻奢全铜吊灯遇到这一问题，百度了一下，发现没找到...
ASCLL UTF-8 GBK URL编码
2021-11-29 11:49

小张在呢的博客编码（信息交换标准代码）编码的由来：在计算机中，所有的数据在存储和运算时都要使用二进制数表示（因为计算机用高电平和低电平分别表示 1 和 0 ）。例如，像 a、b、c、d 这样的 52 个字母（包括大写）以及 0...
utf8&ascii编码
2022-09-07 13:26

Generalzy的博客但是如果每个国家都用自己的标准，那么交流起来就很复杂，所以ISO组织就发明了UNICODE编码，UTF-8（每次传输8位）是UNICODE的一种，向下可兼容ASCII编码。因此对于 python 来说，汉字也是可以比较大小的，所以，判断...
linux-linux环境下ANSI转换为UTF-8
2018-06-20 08:04

lisery_nj的博客使用vim打开linux文本文件，发现乱码。 1. 查询文件的编码格式在Vim中可以直接查看文件编码, :set fileencoding 即可显示文件编码格式。...[root@cdh01 2018-05-02]# ...1¡¢×îºóÐÞ¸ÄÊ±¼äÍ¬²½ ...
VS2015支持UTF-8 with BOM编码格式处理中文、西班牙文、法文等非英文字符
2017-05-24 08:17

Zivei的博客 VS2015支持UTF-8 with BOM编码格式处理中文、西班牙文、法文等非英文字符 Detail：在使用VS2015编译ImageMagicK库时，由于其中源码文件带有非英文字符串，导致VS编译出错。 error C2001： newline in constant
java文件名解码_java – 获取文件名为UTF-8？ (ä,ü,ö……总是’？’)
2021-03-15 13:31

郑丢丢的博客我必须读取一些文件的名称,并将它们作为字符串放在列表中.它不是那么难我只是对...所以我尝试使用如下函数：new String(insert.getBytes(“UTF-8”)要么new String(insert.getBytes(“ISO-8859-1”),“UTF-8”)因为...
没有解决我的问题, 去提问

特殊äö字符打破UTF-8编码

1条回答 默认 最新

1条回答默认最新