duangan7834 2018-05-23 21:35
浏览 127

解码Unicode转义序列时出现意外结果

I have this black box that spits out a JSON, and this file comes with what I assume, are escaped Unicode characters. Here's a snippet:

{
    "AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."
}

Now, here's how the resulting JSON should actually look like to any reasonable human being:

{
    "AR_DESCRI":"LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI."
}

The most importat thing there is that \u00c3\u2018 should equal the Ñ character.

However as you can check from any Unicode Escape Sequence decoder, this is not the case, the ouput for \u00c3\u2018 is actually Ñ which is basically random noise.

I've tried some online decoders and I've also used the json_decode() PHP functions, which is the enviroment I'm currently working on. Both give me the same results. Here's the snippet of code if you are curious:

<?php
$json = '{"AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."}';
print_r(json_decode($json));

//Output: stdClass Object ( [AR_DESCRI] => LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI. )

So my question is, why on earth does this happen, is it an encoding issue on the black box's side? Am I using the wrong function?

Thanks in advance.

  • 写回答

1条回答 默认 最新

  • douyong7199 2018-05-24 01:09
    关注

    Ñ is U+00D1 represented in UTF8 as the literal bytes \xc3\x91.

    What you've got there is Mojibake caused by incorrectly forcing a cp1252-to-UTF conversion on the input string where in cp1252 \xc3 is à and \x91 is . [left single-quote]

    These are then converted into their UTF equivalent escapes as the \u00c3\u2018 you see.

    Proof:

    function ordify($str) {
        return implode(' ', array_map(
            function($a){return sprintf('U+%04x', mb_ord($a));},
            preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY)
        ));
    }
    
    $borked = 'Ñ';
    $fixed  = mb_convert_encoding($borked, 'cp1252', 'utf-8');
    
    var_dump(
        $borked, ordify($borked),
        $fixed,  ordify($fixed)
    );
    

    Output:

    string(5) "Ñ"
    string(13) "U+00c3 U+2018"
    string(2) "Ñ"
    string(6) "U+00d1"
    

    So go fix the thing that's generating your JSON, because any reasonable human being should value producing valid data in the first place over kludging in a bandaid solution.

    评论

报告相同问题?

悬赏问题

  • ¥15 下图接收小电路,谁知道原理
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探