dongqiu9018 2012-02-22 11:51
浏览 39
已采纳

智能报价未正确转换为UTF8

I have a PHP script that imports and parses XML files and saves the data into the database:

  • Database collation: utf8_general_ci, charset: utf8
  • Page's charset : utf-8
  • XML files: ANSI, contains smart quotes (from MS Word)

So during import I do a utf8_encode() on the text from the XML files prior to saving into the database and subsequently displaying on the page.

But when successfully imported, and saved into DB,

  • Database: smart quotes are saved as ? character (viewed from CMD)
  • Page: smart quotes are displayed as boxes

Any ideas as to why the smart quotes are not being converted correctly, even when using utf8_encode()?

EDIT:

@Tomalak: The XML files are actually .txt, no XML declaration (<?xml ... ?>), and no root element. My script actually adds a root element just so the parser works:

utf8_encode('<article>' . file_get_contents($xmlfile) . '</article>');

Seems like I need to add an XML declaration..? If so, how should it look like?

  • 写回答

2条回答 默认 最新

  • dppx9253 2012-02-22 15:09
    关注

    If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.

    As long as you do not use "special" characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the "smart quotes" — it will break because these are represented by different bytes in UTF-8.

    In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:

    '<?xml encoding="Windows-1252"?><article>'.file_get_contents($xmlfile).'</article>'
    

    This way you instruct the DOMDocument how to interpret the bytes in your string. I assumed Windows-1252 for you because you said ANSI and mentioned the curly quotes.

    In fact, 95% of the time this is what people really mean, even on Linux and even if they say ISO-8859-1 (or latin-1), which is almost, but not exactly the same thing.

    To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For Windows-1252. For the curly quotes the expected byte values would be:

    • 147 (0x93)
    • 148 (0x94)

    Once the meaning of the individual bytes in your string is declared, DOMDocument can make sense of them and does the right thing.

    When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don't know enough about PHP/mySQL/Unicode integration to say for sure.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 不同尺寸货物如何寻找合适的包装箱型谱
  • ¥15 求解 yolo算法问题
  • ¥15 虚拟机打包apk出现错误
  • ¥15 用visual studi code完成html页面
  • ¥15 聚类分析或者python进行数据分析
  • ¥15 三菱伺服电机按启动按钮有使能但不动作
  • ¥15 js,页面2返回页面1时定位进入的设备
  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝