It's not that these characters have broken the encoding, it's just that Unicode is really complicated.
Commonly used accented letters have their own code points in the Unicode standard, in this case:
- U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
- U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"
However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:
- U+0308 "COMBINING DIAERESIS"
When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.
As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:
- Normalization Form D (NFD): Canonical Decomposition
- Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
- Normalization Form KD (NFKD): Compatibility Decomposition
- Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
Ignoring the "Compatibility" forms for now, we have two options:
- Decomposition, which uses combining diacritics as often as possible
- Composition, which uses specific code points as often as possible
So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer
class in the intl
extension.
However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.
You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.