doucheng7534 2013-01-28 14:03
浏览 207
已采纳

字符编码utf8到latin1,解释这两个字符

I have a database which uses latin-1 and a PHP application which is utf-8.

I have strings in the database like this:

'Société' which should be Société

'€1bn' which should be €2bn.

When I print the faulty characters to screen with PHP's ord(), from the returning data in the db, it prints 195 and 226.

Could somebody explain why this is happening (why saving like this and why characters being read as they are) and if I can reverse it.

  • 写回答

2条回答 默认 最新

  • dpjuppr1361 2013-01-28 14:06
    关注

    The WHY:

    1) é is unicode 233 (as the browser reads it).
    é utf8 bytes converted into latin1 chars bytes is à ©. This is why it appears like this in the database.
    à © is recognised as à which is code point 195. Hence why you see that.

    2) € is unicode 8364.
    € utf8 bytes converted into latin1 chars bytes is â <82> ¬. Again this is why they appear like this in the db.
    â <82> ¬ is recognised as â which is code point 226. Again this is why you see this.

    That is why you see those values from ord() and why the characters are stored in that manner in a latin-1 database.

    Reverse:

    To reverse it we need Latin-1 char bytes to UTF8 bytes.

    If we try it:
    â is 226. Converted latin-1 to utf8 produces â.
    à is 195. Converted latin-1 to utf8 produces Ã.

    Problem:

    The problem is Latin-1 has less characters than utf-8 (by a long way).
    Latin1 single-byte stream and UTF8 multi-byte char stream so 1 char in utf8 could produce up to 4 chars for latin1.
    So the UTF-8 to Latin-1 conversion produces faulty characters.
    Latin1 back to utf8 is not possible.

    Solution:

    IF you are unable to change the character set of your database I could suggest encoding special characters in the database in their character entity before writing them (so the db can stay as latin1 and app as utf8 as both can understand html entities) e.g. umlaut as &Auml;.
    It could be done using PHPs html_entity_decode() combined with mb_detect_encoding() to detect and convert specific characters.

    References:

    See ltf.ed.ac.uk for the utf8 char bytes to latin1 bytes:
    http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%96&mode=char

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 目详情-五一模拟赛详情页
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?
  • ¥15 加热介质是液体,换热器壳侧导热系数和总的导热系数怎么算
  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥15 cmd cl 0x000007b
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line