doulu1325 2014-12-29 11:11
浏览 204
已采纳

如何将这个UTF-8转义字符串从亚马逊MWS响应转换为正确的UTF-8?

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:

<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>

The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).

However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into

Ramírez Jones

into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).

Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes

RamÃ-­rez Jones

For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as Ramírez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!

Here is some example code to show this problem:

$xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."
";
echo $elem->Name->__toString()."
";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());

Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:

UTF-8
Ramírez Jones
RamA-rez Jones

How can we avoid this problem? It's really screwing things up.

EDIT:

Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).

REVISED FINAL SOLUTION:

It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:

echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));

This works because "&#xC3;&#xAD;" are HTML entities.

ALTERNATE SOLUTION

Strangely, this also works:

$xml = '<?xml version="1.0"?><Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name; 

展开全部

  • 写回答

2条回答 默认 最新

  • douxing9641 2014-12-29 11:49
    关注

    SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.

    function decode_hexentities($xml) {
      return
        preg_replace_callback(
          '~&#x([0-9a-fA-F]+);~i', 
          function ($matches) { return chr(hexdec($matches[1])); }, 
          $xml
        );
    }
    
    $xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
    $xml = decode_hexentities($xml);
    $elem = new SimpleXMLElement($xml);
    $bad_string = $elem->Name;
    echo mb_detect_encoding($bad_string)."
    ";
    echo $elem->Name->__toString()."
    ";
    echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
    

    results in:

    UTF-8
    Ramírez Jones
    Ramirez Jones
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)
编辑
预览

报告相同问题?

手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部