If you want to get UTF-8 in the $_POST
array you need to tell the browser that the form is to be submitted in UTF-8.
Generally the way to achieve this is to serve the page containing the form with an indicator that the page is encoded as UTF-8. Otherwise, the browser will arbitrarily guess which encoding is in use, and that guess probably won't be UTF-8. To indicate UTF-8 set the Content-Type
header or include in the <head>
:
<meta charset="utf-8"/>
If you include the character 人
in a form field and the browser thinks the encoding is one (like cp1252 Western European) that does not include the character 人
, it will panic and send instead an HTML-character-reference-encoded version, 人
. This is a non-useful data mangling as you can't tell whether the original input was 人
or 人
, but it's an historical browser quirk we will now never get rid of.
This is why you get 2600000023000000
: characters U+0026,U+0023 are the leading &#
part of that mangled version. The rest of that string is 00
and not the subsequent characters because base_convert
deals with floating-point numbers and 0x2600000023000000000000000000000000000000000000000000000000
is far too ludicrously large a number to retain precision.
If you are trying to convert UTF-8-encoded characters into numeric code points, try uniord
/unichr
.