PHP DomDocument无法处理utf-8字符（☆）

The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

Here's a quick program, to test if the output works:

<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

$dom = new DomDocument("1.0", "utf-8");
$dom->loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());

The output of the program is:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
</body></html>

Which renders as:

â˜† Hello â˜† World â˜†

What could I be doing wrong? How much more specific do I have to be to tell the DomDocument to handle utf-8 properly?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douyiavxxh02727 2012-07-03 11:47
关注
DOMDocument::loadHTML() expects a HTML string.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

Those characters that have named entities, will get the named entitiy. € -> €

The others get their numeric (decimal) entity, e.g. ☆ -> ☆

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) { list($utf8) = $match; $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8'); printf("%s -> %s ", $utf8, $entity); return $entity; }, $html);

This exemplary outputs for your string:

☆ -> ☆ ☆ -> ☆ ☆ -> ☆

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument(); $dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <meta charset="utf-8"> <title>Test!</title> </head> <body> <h1>☆ Hello ☆ World ☆</h1> </body> </html>
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

PHP DomDocument无法处理utf-8字符（☆） php
2012-07-03 10:40

回答 3 已采纳 DOMDocument::loadHTML() expects a HTML string. HTML uses the ISO-8859-1 encoding (ISO Latin Alpha
如何将这个UTF-8转义字符串从亚马逊MWS响应转换为正确的UTF-8？ mongodb php
2014-12-29 19:11

回答 2 已采纳 SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not
字符编码问题 - UTF-8 /在互联网上传输数据时出现问题？ php
2011-07-21 12:39

回答 3 已采纳 htmlentities interprets its input as ISO-8859-1 by default; are you passing UTF-8 for the charset
php domdocument 字符串,PHP DomDocument无法处理utf-8字符（☆）
2021-04-21 07:13

得是的博客 DOMDocument::loadHTML() 需要一个HTML字符串。HTML ISO-8859-1根据其规范使用默认的编码(ISO拉丁字母1号)。那是因为更长，请参见6.1。HTML文档字符集。实际上，这更是Windows-1252常见的Web浏览器的默认支持。我之...
PHP html字符串到DOMDocument没有返回每个元素的数组 html php
2014-10-22 22:18

回答 1 已采纳 The values inside your array are overwritten, thus getting only the last value. Create a temporary
DomDocument和两个字节写的特殊字符 php xml
2011-01-11 15:18

回答 1 已采纳 One problem when switching between encodings is that, even with transliteration, not all character
如何将XML字符串附加到DOMDocument对象？ php xml
2014-07-24 17:25

回答 1 已采纳 You're trying to append the original node - not the imported one. $Signature_node = $getToken_obj
php domdocument中文乱码,PHP DomDocument无法处理utf-8字符(☆)
2021-04-22 10:09

晒太阳的小黑猫的博客 DOMDocument::loadHTML()需要一个HTML字符串。HTML使用ISO-8859-1编码(ISO拉丁字母第1号)作为默认值。这是因为更长，见6.1. The HTML Document Character Set.实际上，这是更多的默认支持Windows-1252在常见的Web...
如何区分DOMDocument中的空元素和空大小的字符串？ php xml
2014-06-07 14:06

回答 3 已采纳 The problem to distinguish between those two is, that when DOMDocument loads the XML serialized do
Delphi，MSXML2.XMLHTTP，PHP和Win-1250字符集编码 php xml
2012-06-11 13:38

回答 2 已采纳 The source of the problem was the Delphi code. Priorly I used AnsiToUTF8 to encode the XML text.
PHP如何将$ html_div（字符串类型）附加到DOMElement的子节点 html php
2013-08-14 09:30

回答 1 已采纳 I would suggest you to use PHP's Simple HTML DOM Parser and do it easily (just like jQuery) inclu
php domdocument内存,PHP DomDocument无法处理utf-8字符（☆）
2021-04-21 15:52

米你教育的博客汪汪一只猫DOMDocument::loadHTML() 需要一个HTML字符串。HTML ISO-8859-1根据其规范使用默认的编码(ISO拉丁字母1号)。那是因为更长，请参见6.1。HTML文档字符集。实际上，这更是Windows-1252常见的Web浏览器的默认...
如何通过php / javascript保存javascript字符串中包含的xml javascript php xml
2014-01-24 06:37

回答 2 已采纳 Sorry to be answering my own question, but thought that it might be of use to someone else too.
domdocument php charset gbk,PHP DomDocument无法处理utf-8字符（☆）
2021-04-17 04:08

weixin_39533432的博客小编典典DOMDocument::loadHTML()需要一个HTML字符串。HTML ISO-8859-1根据其规范使用默认的编码(ISO拉丁字母1号)。那是因为更长，请参见 6.1。HTML文档字符集。实际上，这更是Windows-1252常见的Web浏览器的默认...
php loadhtml 乱码,PHP DOMDocument loadHTML没有正确编码UTF-8
2021-04-17 02:26

wo91rmb的博客 PHP DOMDocument loadHTML没有正确编码UTF-8我正在尝试使用DOMDocument解析一些HTML，但是当我这样做时，我突然失去了编码(至少这对我来说是这样)。$profile="variousjapanesecharacters";$dom=newDOMDocument();$...
没有解决我的问题, 去提问

悬赏问题

¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 保护模式-系统加载-段寄存器

PHP DomDocument无法处理utf-8字符（☆）

â˜† Hello â˜† World â˜†

3条回答 默认 最新

悬赏问题

3条回答默认最新