doulu1325 2014-12-29 19:11
浏览 204
已采纳

如何将这个UTF-8转义字符串从亚马逊MWS响应转换为正确的UTF-8?

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:

<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>

The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).

However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into

Ramírez Jones

into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).

Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes

RamÃ-­rez Jones

For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as Ramírez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!

Here is some example code to show this problem:

$xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."
";
echo $elem->Name->__toString()."
";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());

Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:

UTF-8
Ramírez Jones
RamA-rez Jones

How can we avoid this problem? It's really screwing things up.

EDIT:

Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).

REVISED FINAL SOLUTION:

It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:

echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));

This works because "&#xC3;&#xAD;" are HTML entities.

ALTERNATE SOLUTION

Strangely, this also works:

$xml = '<?xml version="1.0"?><Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name; 
  • 写回答

2条回答 默认 最新

  • douxing9641 2014-12-29 19:49
    关注

    SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.

    function decode_hexentities($xml) {
      return
        preg_replace_callback(
          '~&#x([0-9a-fA-F]+);~i', 
          function ($matches) { return chr(hexdec($matches[1])); }, 
          $xml
        );
    }
    
    $xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
    $xml = decode_hexentities($xml);
    $elem = new SimpleXMLElement($xml);
    $bad_string = $elem->Name;
    echo mb_detect_encoding($bad_string)."
    ";
    echo $elem->Name->__toString()."
    ";
    echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
    

    results in:

    UTF-8
    Ramírez Jones
    Ramirez Jones
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)