drqyxkzbs21968684 2014-03-19 17:27
浏览 45

PHP json_decode的UTF-8问题

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.

I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.

I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):

$val = array("Millán");
print json_encode($val)."

According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.

Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):

$ grep ill test.php | od -An -t x1c
  24  76  61  6c  20  3d  20  61  72  72  61  79  28  22  4d  69
   $   v   a   l       =       a   r   r   a   y   (   "   M   i
  6c  6c  c3  a1  6e  22  29  3b  0a
   l   l 303 241   n   "   )   ;  

And here is the output from PHP:

$ php -f test.php | od -An -t x1c
  5b  22  4d  69  6c  6c  5c  75  30  30  65  31  6e  22  5d  0a
   [   "   M   i   l   l   \   u   0   0   e   1   n   "   ]  

The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.

How can I keep PHP/json_encode from switching the encoding of this variable?

EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

  • 写回答

2条回答 默认 最新

  • doubengshao8872 2014-03-19 19:05

    This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?



  • ¥15 关于超局变量获取查询的问题
  • ¥20 Vs code Mac系统 PHP Debug调试环境配置
  • ¥60 大一项目课,微信小程序
  • ¥15 求视频摘要youtube和ovp数据集
  • ¥15 在启动roslaunch时出现如下问题
  • ¥15 汇编语言实现加减法计算器的功能
  • ¥20 关于多单片机模块化的一些问题
  • ¥30 seata使用出现报错,其他服务找不到seata
  • ¥35 引用csv数据文件(4列1800行),通过高斯-赛德尔法拟合曲线,在选取(每五十点取1点)数据,求该数据点的曲率中心。
  • ¥20 程序只发送0X01,串口助手显示不正确,配置看了没有问题115200-8-1-no,如何解决?