Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg,
Collator::sort
forsort
).All byte functions (eg,
strlen
,strstr
,strpos
, andsubstr
) work like the corresponding character functions (eg,mb_strlen
,mb_strstr
,mb_strpos
, andmb_substr
).All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have
/u
tacked on implicitly, and things like\w
and\b
and\s
all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen
, grapheme_strstr
, grapheme_strpos
, and grapheme_substr
), and the regex stuff works on proper graphemes (ie, .
— or even [^abc]
— matches a Unicode grapheme cluster no matter how many code points it contains, etc).