dongzhi6905 2013-07-17 17:52
浏览 62
已采纳

UTF-8字符串的解码会破坏一个字符串,但不会损坏另一个字符串

I'm having a very strange error.

I have verified that both my strings are UTF-8 (Checked through mb_check_encoding and mb_detect_encoding) but when I attempt to use utf8_decode on the string, it will return garbage characters to me. In this case, I actually do not need to use utf8_decode and the string will be normal.

The difficulty is that I have customers using UTF-8 databases that I pull strings from and I use utf8_decode to ungarble the strings for PHP. If I don't the space characters will be replaced with à . They share the same code to generate the string, but for some reason when I generate it for this other customer, the strings come out all wrong.

Is there a way for me to verify that I will need to use utf8_decode other than the fact that the string is utf 8?

Some Examples:

Using utf8_decode for customer 1:
?0,107�per�km
Without utf8_decode for customer 1:
€0,107 per km

Using utf8_decode for customer 2:
$7.00 per km
Without utf8_decode for customer 2:
$7.00 per km

Thanks guys!

  • 写回答

1条回答 默认 最新

  • duanhong8839 2013-07-17 19:16
    关注

    mb_detect_encoding without an informed detect_order is no silver bullet, as this would demonstrate:

    $ php -r 'echo mb_detect_encoding(iconv("utf-8","iso-8859-1","ë"));'
    UTF-8
    

    Obviously wrong, setting it to strict helps a little bit:

    $ php -r 'var_dump(mb_detect_encoding(iconv("utf-8","iso-8859-1","ë"),mb_detect_order(),true));'
    bool(false)
    

    Why is it false? Well, let's examine the possible character sets mb_detect_encoding() uses in my configuration:

    $ php -r 'var_dump(mb_detect_order());'
    array(2) {
      [0] =>
      string(5) "ASCII"
      [1] =>
      string(5) "UTF-8"
    }
    

    Well, save for ASCII & UTF-8, no other character set will be detected. Jon has a point though: you can store it all as utf-8, and with the proper database settings, or even only just a correct character_set_results in a mysql (which I assume you use...) connection would do the trick to retrieve it as utf-8 regardless of how it's stored. However, if this is not an option for whatever reason I can't think of, it's up to you to specificy which character sets are possible for mb_detect_order.

    $ php -r 'echo mb_detect_encoding(iconv("utf-8","iso-8859-1","ë"),"ASCII,UTF-8,ISO-8859-1,JIS", true);'
    ISO-8859-1
    

    In short: you are responsible for providing a list of possible character sets, and if you already have that kind of information... you can probably know the character set (by connection settings, database/table settings, or even just client-configuration, etc.) rather then to try to detect it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?