dongyonglie5132
dongyonglie5132
2011-07-17 11:36

如何在PHP中检测格式错误的utf-8字符串?

已采纳

iconv function sometimes gives me an error:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in utf-8 string before putting data to inconv ?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

4条回答

  • dongshi1880 dongshi1880 10年前

    First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

    You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

    $isUTF8 = preg_match('//u', $string);
    

    Another possibility is mb_check_encoding [PHP Manual]:

    $validUTF8 = mb_check_encoding($string, 'UTF-8');
    

    Another function you can use is mb_detect_encoding [PHP Manual]:

    $validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
    

    It's important to set the strict parameter to true.

    Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

    echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
    echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
    

    You can use @ and check the length of the return string:

    strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
    

    Check the examples on the iconv manual page as well.

    You have not shared the source code where the notice is resulting from. You should add it if you want a more concrete suggestion.

    点赞 评论 复制链接分享
  • dongshi1966 dongshi1966 10年前

    put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in source encoding id to ignore invalid characters:

    @iconv( 'UTF-8//IGNORE', $destinationEncoding, $yourString );
    
    点赞 评论 复制链接分享
  • dtjo87679 dtjo87679 10年前

    You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

    点赞 评论 复制链接分享
  • dougaojue8185 dougaojue8185 10年前

    The specification on which characters that are invalid in UTF-8 is pretty clear. You probably wanna strip those out before trying to parse it. They shouldn't be there so if you could avoid it even before generating the XML that would be even better.

    See here for a reference:

    http://www.w3.org/TR/xml/#charsets

    That isn't a complete list, many parser also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

    However, iconv might have builtin support for this:

    http://www.zeitoun.net/articles/clear-invalid-utf8/start

    点赞 评论 复制链接分享