duanhe6464 2019-02-28 13:22
浏览 547


A user on my site inputted special characters into a text field: ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.

What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

  • 写回答

1条回答 默认 最新

  • douniewei6346 2019-02-28 13:37

    It's not that these characters have broken the encoding, it's just that Unicode is really complicated.

    Commonly used accented letters have their own code points in the Unicode standard, in this case:


    However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:


    When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.

    As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:

    • Normalization Form D (NFD): Canonical Decomposition
    • Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
    • Normalization Form KD (NFKD): Compatibility Decomposition
    • Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

    Ignoring the "Compatibility" forms for now, we have two options:

    • Decomposition, which uses combining diacritics as often as possible
    • Composition, which uses specific code points as often as possible

    So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.

    However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.

    You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?



  • ¥15 脱敏项目合作,ner需求合作
  • ¥30 Matlab打开默认名称带有/的光谱数据
  • ¥50 easyExcel模板 动态单元格合并列
  • ¥15 res.rows如何取值使用
  • ¥15 在odoo17开发环境中,怎么实现库存管理系统,或独立模块设计与AGV小车对接?开发方面应如何设计和开发?请详细解释MES或WMS在与AGV小车对接时需完成的设计和开发
  • ¥15 CSP算法实现EEG特征提取,哪一步错了?
  • ¥15 游戏盾如何溯源服务器真实ip?需要30个字。后面的字是凑数的
  • ¥15 vue3前端取消收藏的不会引用collectId
  • ¥15 delphi7 HMAC_SHA256方式加密
  • ¥15 关于#qt#的问题:我想实现qcustomplot完成坐标轴