智能报价未正确转换为UTF8

I have a PHP script that imports and parses XML files and saves the data into the database:

Database collation: utf8_general_ci, charset: utf8
Page's charset : utf-8
XML files: ANSI, contains smart quotes (from MS Word)

So during import I do a utf8_encode() on the text from the XML files prior to saving into the database and subsequently displaying on the page.

But when successfully imported, and saved into DB,

Database: smart quotes are saved as ? character (viewed from CMD)
Page: smart quotes are displayed as boxes

Any ideas as to why the smart quotes are not being converted correctly, even when using utf8_encode()?

EDIT:

@Tomalak: The XML files are actually .txt, no XML declaration (<?xml ... ?>), and no root element. My script actually adds a root element just so the parser works:

utf8_encode('<article>' . file_get_contents($xmlfile) . '</article>');

Seems like I need to add an XML declaration..? If so, how should it look like?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dppx9253 2012-02-22 15:09
关注
If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.

As long as you do not use "special" characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the "smart quotes" — it will break because these are represented by different bytes in UTF-8.

In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:

'<?xml encoding="Windows-1252"?><article>'.file_get_contents($xmlfile).'</article>'

This way you instruct the DOMDocument how to interpret the bytes in your string. I assumed Windows-1252 for you because you said ANSI and mentioned the curly quotes.

In fact, 95% of the time this is what people really mean, even on Linux and even if they say ISO-8859-1 (or latin-1), which is almost, but not exactly the same thing.

To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For Windows-1252. For the curly quotes the expected byte values would be:

“ 147 (0x93)

” 148 (0x94)

Once the meaning of the individual bytes in your string is declared, DOMDocument can make sense of them and does the right thing.

When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don't know enough about PHP/mySQL/Unicode integration to say for sure.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

智能报价未正确转换为UTF8 mysql php xml
2012-02-22 11:51

回答 2 已采纳 If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that
Golang将UTF16字符串转换为UTF8
2016-10-19 00:29

回答 2 已采纳 Parse the hex string as an integer. Use a string conversion to convert the integer to UTF-8. n, e
VBscript实现UTF8中文转换为Unicode Hex编码 .net asp.net
2023-02-02 15:52

回答 1 已采纳错误1是因为，在 UTF-8 编码下，每个中文字符实际上由 3 个字节组成，而非 1 个字节。因此，函数 Len(input) 返回的结果是正确的。错误2是因为，您的代码使用了 AscW 函数来获取
ASCII UTF-8 GBK GB2312 Unicode
2020-08-30 23:59

钟离默的博客 UTF-8 的编码规则是： **① **对于单字节的符号，字节的第一位设为 0，后面的7位为这个符号的 Unicode 码，因此对于英文字母，UTF-8 编码和 ASCII 码是相同的。 **② **对于n字节的符号(n>1),第一个字节的前 n 位都...
如何在Go中将所有编码都转换为UTF 8？ mongodb
2014-12-04 15:07

回答 3 已采纳 I'm using the go-charset project to do this: https://code.google.com/p/go-charset/ It's pretty st
php文件从ansi转换为utf-8的错误 php
2015-07-16 19:50

回答 1 已采纳 The issue was with the "£" (pound) character, I used it a lot as delimiter in preg_match("£(...)£"
python如何实现批量修改文件编码为utf8 python
2018-03-18 11:09

回答 5 已采纳此处用的是python2，main函数为主函数，请采纳，如有疑问，请回复。 ``` # coding = utf-8 import os path = r"D:\课件临时\2" d
2021年安徽省大数据与人工智能应用竞赛人工智能(网络赛)-本科组赛题
2021-10-26 09:44

Steven灬的博客第一部分：人工智能基础环境搭建部署（15分）注：任务1与任务2任选一题完成即可。 o任务1：Anaconda 3、scikit-learn、OpenCV 3.X、PyTorch 1.8.X、torchvision 0.9.X库的安装与配置。要求（1）需使用比赛平台...
VB将汉字字符串转换成 UTF-8格式
2015-11-29 13:11

回答 1 已采纳 http://www.williamlong.info/archives/1136.html
C++ utf8编码怎么转换为ANSI编码? c++ c语言
2022-10-01 09:42

回答 2 已采纳建议你看下这篇博客👉 ：C++实现ANSI编码转换为UTF-8编码格式文件
脚本标记中的Symfony dom-crawler字符串转换为UTF8 php symfony
2016-04-09 17:48

回答 1 已采纳 Let's see how symfony/dom-crawler works. Here's an example to start with: <?php require 've
Unicode UTF-8 UTF-16 UTF-32的关系
2020-10-02 20:30

strongerHuang的博客好处：无需转换，速度快坏处：浪费存储空间 T = 32bit 2.UTF-8 UTF-8是一种变长编码，对于一个Unicode的字符被编码成1至4个字节。Unicode编码与UTF-8的编码的对应关系： Unicode编码 UTF-8编码(二进制) U+0000 – ...
设置字符集并转换为utf-8而不是bom php
2013-11-07 12:40

回答 1 已采纳 PHP does not have any concept of character encodings; strings are binary data. The trick that make
【大数据毕设】基于Hadoop的音乐推荐系统的设计和实现(六)
2023-09-25 00:00

AI_Maynor的博客作为基于大数据的音乐推荐系统，其功能主要是对数据进行处理，保证能够在大量低质量的数据中筛选出高质量的数据，在这个过程中要保证能够数据的准确性以及结果的准确性，再结合需求进行剖析，在设计系统时要从程序、...
大数据练习题
2022-05-28 16:06

哎一入江湖岁月催的博客单选题（共 201 题，共 201 ...正确答案: D 解析 [技能点:] Linux操作系统 > Linux内核与模块 2. (1 分) 下列哪个程序通常与 NameNode 在一个节点启动? A：SecondaryNameNode B：DataNode C：TaskTracker D：...
没有解决我的问题, 去提问

悬赏问题

¥15 不同尺寸货物如何寻找合适的包装箱型谱
¥15 求解 yolo算法问题
¥15 虚拟机打包apk出现错误
¥15 用visual studi code完成html页面
¥15 聚类分析或者python进行数据分析
¥15 三菱伺服电机按启动按钮有使能但不动作
¥15 js，页面2返回页面1时定位进入的设备
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝

智能报价未正确转换为UTF8

2条回答 默认 最新

悬赏问题

2条回答默认最新