dongse5408
2018-05-30 18:07
浏览 408
已采纳

PHP JSON_encode()收到“格式错误的UTF-8字符,可能编码错误”(错误)

I cannot solve this issue and I'm getting crazy.

JSON_encode() is casting the error: Malformed UTF-8 characters, possibly incorrectly encoded on few records (2 or 3) from a set of 10k records. However this seems very impossible to fix.

  • mysql is already utf8mb4 everywhere (database, table, columns and collation)
  • php is 7.2 and of course in utf8
  • apache default charset is utf8 (however the error is throw at PHP-level).

I can also print to screen correctly the record in PHP without issue in a simple HTML debug page. However If I try to encode it in JSON I get the error.

I found that these records have been imported from a CVS probably bypassing the cleaner. What is so strange is that the entire CSV file is parsed with:

$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
    $source = iconv($this->encoding, "UTF-8", $source);
} 

I cannot post any full broken data due to the privacy (and GDPR). However I succeed to extract a part which seems to be the broken one:

RESIDENCE �PRINCIPE

UPDATES

I try to get the bitcode of these broken chars. This is what I found. In ASCII by using simple native function str_split and ord these char is:

'�' 160

I would like to find the bitcode also in utf8, so I find this usefull function on PHP.net http://php.net/manual/en/function.ord.php#109812 Which try to find bitcode of MultiByteStrings. and it gives me:

-2096

Which is....... negative?

图片转代码服务由CSDN问答提供 功能建议

我无法解决这个问题而且我已经疯了。

< 代码> JSON_encode()正在从一组10k记录中输出错误:格式错误的UTF-8字符,可能是错误编码的在几条记录(2或3)上。 但是这似乎很 不可能修复。

  • mysql已经是utf8mb4无处不在(数据库,表,列和整理)
  • php是7.2当然是 utf8
  • apache默认字符集是utf8(但错误是在PHP级别抛出)。

    我也可以打印到屏幕 在一个简单的HTML调试页面中正确记录PHP中没有问题的记录。 但是,如果我尝试用JSON编码,我会收到错误。

    我发现这些记录是从CVS导入的,可能绕过了清理程序。 奇怪的是,整个CSV文件解析为:

      $ this-&gt; encoding = mb_detect_encoding($ source,mb_detect_order(),true); 
    if(  $ this-&gt; encoding!=“”&amp;&amp; $ this-&gt; encoding!=“UTF8”){
     $ source = iconv($ this-&gt; encoding,“UTF-8”,$ source)  ; 
    } 
       
     
     

    由于隐私(和GDPR),我无法发布任何完整的损坏数据。 但是,我成功提取了一个似乎是 打破一个:

     RESIDENCE�PRINCIPE
       
     
     

    更新 \ n

    我试图获得这些破碎的字符的bitcode。 这就是我发现的。 在ASCII中,使用简单的本机函数 str_split ord ,这些char是:

     '  �'160 
       
     
     

    我想在utf8中找到bitcode,所以我在PHP.net上找到这个有用的函数 http://php.net/manual/en/function.ord.php#109812 哪怕尝试 找到MultiByteStrings的bitcode。 它给了我:

      -2096 
       
     
     

    哪个是.......否定?< / p>

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • dsqe46004 2018-05-31 09:32
    已采纳

    SOLVED!

    The issue was in the function mb_detect_order(), this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.

    But I just found that this function return just 2 encoding:

    //print_r(mb_detect_order());
    Array
    (
        [0] => ASCII
        [1] => UTF-8
    )
    

    Which is almost completly useless in my case. MB functions can detect much more charset. You can check them out by run mb_list_encodings() and get the full list:

    //print_r(mb_list_encodings());
    Array
    (
        [0] => pass
        [1] => auto
        [2] => wchar
        [3] => byte2be
        [4] => byte2le
        [5] => byte4be
        [6] => byte4le
        [7] => BASE64
        [8] => UUENCODE
        [9] => HTML-ENTITIES
        [10] => Quoted-Printable
        [11] => 7bit
        [12] => 8bit
        [13] => UCS-4
        [14] => UCS-4BE
        [15] => UCS-4LE
        [16] => UCS-2
        [17] => UCS-2BE
        [18] => UCS-2LE
        [19] => UTF-32
        [20] => UTF-32BE
        [21] => UTF-32LE
        [22] => UTF-16
        [23] => UTF-16BE
        [24] => UTF-16LE
        [25] => UTF-8
        [26] => UTF-7
        [27] => UTF7-IMAP
        [28] => ASCII
        [29] => EUC-JP
        [30] => SJIS
        [31] => eucJP-win
        [32] => EUC-JP-2004
        [33] => SJIS-win
        [34] => SJIS-Mobile#DOCOMO
        [35] => SJIS-Mobile#KDDI
        [36] => SJIS-Mobile#SOFTBANK
        [37] => SJIS-mac
        [38] => SJIS-2004
        [39] => UTF-8-Mobile#DOCOMO
        [40] => UTF-8-Mobile#KDDI-A
        [41] => UTF-8-Mobile#KDDI-B
        [42] => UTF-8-Mobile#SOFTBANK
        [43] => CP932
        [44] => CP51932
        [45] => JIS
        [46] => ISO-2022-JP
        [47] => ISO-2022-JP-MS
        [48] => GB18030
        [49] => Windows-1252
        [50] => Windows-1254
        [51] => ISO-8859-1
        [52] => ISO-8859-2
        [53] => ISO-8859-3
        [54] => ISO-8859-4
        [55] => ISO-8859-5
        [56] => ISO-8859-6
        [57] => ISO-8859-7
        [58] => ISO-8859-8
        [59] => ISO-8859-9
        [60] => ISO-8859-10
        [61] => ISO-8859-13
        [62] => ISO-8859-14
        [63] => ISO-8859-15
        [64] => ISO-8859-16
        [65] => EUC-CN
        [66] => CP936
        [67] => HZ
        [68] => EUC-TW
        [69] => BIG-5
        [70] => CP950
        [71] => EUC-KR
        [72] => UHC
        [73] => ISO-2022-KR
        [74] => Windows-1251
        [75] => CP866
        [76] => KOI8-R
        [77] => KOI8-U
        [78] => ArmSCII-8
        [79] => CP850
        [80] => JIS-ms
        [81] => ISO-2022-JP-2004
        [82] => ISO-2022-JP-MOBILE#KDDI
        [83] => CP50220
        [84] => CP50220raw
        [85] => CP50221
        [86] => CP50222
    )
    

    I was in wrong, thinking that mb_detect_order was just an ordered version of this list. The mb_detect_order is just.... useless. In order to encode in UTF8 in the right way use the following code:

    $my_encoding_list = [
        "UTF-8",
        "UTF-7",
        "UTF-16",
        "UTF-32",
        "ISO-8859-16",
        "ISO-8859-15",
        "ISO-8859-10",
        "ISO-8859-1",
        "Windows-1254",
        "Windows-1252",
        "Windows-1251",
        "ASCII",
        //add yours preferred
    ];
    
    //remove unsupported encodings
    $encoding_list = array_intersect($my_encoding_list, mb_list_encodings());
    
    //detect 'finally' the encoding
    $this->encoding = mb_detect_encoding($source,$encoding_list,true);
    

    This worked and solved my issue with bad data saved in the database.

    已采纳该答案
    评论
    解决 无用
    打赏 举报
  • doubingling4706 2018-05-31 08:54

    You can filter these unknown characters by using the UTF-8//IGNORE charset in your iconv method.

    $this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
    
    if ($this->encoding!="" && $this->encoding!="UTF8") {
        $source = iconv($this->encoding, "UTF-8//IGNORE", $source);
    } 
    

    By using the //IGNORE after your charset, every characters that cannot be represented in the target charset will be silently discarded.

    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题