2014-03-19 17:27
PHP json_decode的UTF-8问题

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.

I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.

I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):

$val = array("Millán");
print json_encode($val)."

According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.

Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):

$ grep ill test.php | od -An -t x1c
  24  76  61  6c  20  3d  20  61  72  72  61  79  28  22  4d  69
   $   v   a   l       =       a   r   r   a   y   (   "   M   i
  6c  6c  c3  a1  6e  22  29  3b  0a
   l   l 303 241   n   "   )   ;  

And here is the output from PHP:

$ php -f test.php | od -An -t x1c
  5b  22  4d  69  6c  6c  5c  75  30  30  65  31  6e  22  5d  0a
   [   "   M   i   l   l   \   u   0   0   e   1   n   "   ]  

The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.

How can I keep PHP/json_encode from switching the encoding of this variable?

EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

  2014-03-19 19:05

    This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

