duanhe6464
2019-02-28 13:22
浏览 516
已采纳

特殊äö字符打破UTF-8编码

A user on my site inputted special characters into a text field: ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.

What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

图片转代码服务由CSDN问答提供 功能建议

我网站上的用户在文本字段中输入了特殊字符:äö < 这些显然不同于我可以从键盘输入的ä字符,因为当我将它们粘贴到Programmer的记事本中时,它们会分成两部分:a¨o¨

在我的网站服务器上 我有一个PHP脚本,用于识别用户输入中的非法特殊字符,并使用 preg_replace 在html错误消息中高亮显示它们。

字符分裂也在那里发生,所以我得到一个普通的字母a和o,其中有一个奇怪的单独xCC字符,它破坏了UTF-8字符串编码和 json_encode 函数 因此失败了。

处理这些字符的最佳方法是什么? 我是否应该尝试更换特殊的äö字符并用常规字符替换它们,还是可以以某种方式捕获破坏的UTF-8字符并删除或替换它们?

  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • douniewei6346 2019-02-28 13:37
    已采纳

    It's not that these characters have broken the encoding, it's just that Unicode is really complicated.

    Commonly used accented letters have their own code points in the Unicode standard, in this case:

    • U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
    • U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"

    However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:

    • U+0308 "COMBINING DIAERESIS"

    When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.

    As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:

    • Normalization Form D (NFD): Canonical Decomposition
    • Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
    • Normalization Form KD (NFKD): Compatibility Decomposition
    • Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

    Ignoring the "Compatibility" forms for now, we have two options:

    • Decomposition, which uses combining diacritics as often as possible
    • Composition, which uses specific code points as often as possible

    So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.

    However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.

    You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.

    打赏 评论

相关推荐 更多相似问题