dongqiyou0303
2012-07-03 10:40
浏览 76
已采纳

PHP DomDocument无法处理utf-8字符(☆)

The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

Here's a quick program, to test if the output works:

<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

$dom = new DomDocument("1.0", "utf-8");
$dom->loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());

The output of the program is:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
</body></html>

Which renders as:

☆ Hello ☆ World ☆


What could I be doing wrong? How much more specific do I have to be to tell the DomDocument to handle utf-8 properly?

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

3条回答 默认 最新

  • douyiavxxh02727 2012-07-03 11:47
    已采纳

    DOMDocument::loadHTML() expects a HTML string.

    HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

    I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

    I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

    Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

    • Those characters that have named entities, will get the named entitiy. € -> &euro;
    • The others get their numeric (decimal) entity, e.g. ☆ -> &#9734;

    The following is a code example that makes the progress a bit more visible by using a callback function:

    $html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
        list($utf8) = $match;
        $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
        printf("%s -> %s
    ", $utf8, $entity);
        return $entity;
    }, $html);
    

    This exemplary outputs for your string:

    ☆ -> &#9734;
    ☆ -> &#9734;
    ☆ -> &#9734;
    

    Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

    $us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');
    

    Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

    The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    

    which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

    If you don't care the misplaced warnings, you can just add it in front of the string:

    $dom = new DomDocument();
    $dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);
    

    Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

    <!DOCTYPE html>
    <html>
      <head>
        <meta http-equiv="content-type" content="text/html; charset=utf-8">
        <meta charset="utf-8">
        <title>Test!</title>
      </head>
      <body>
        <h1>☆ Hello ☆ World ☆</h1>    
      </body>
    </html>
    
    已采纳该答案
    评论
    解决 无用
    打赏 举报
  • douhao2026 2012-07-03 10:52
    <?php
      header("Content-type: text/html; charset=utf-8");
      $html = <<<HTML
    <!doctype html>
    <html>
    <head>
        <meta charset="utf-8">
        <title>Test!</title>
    </head>
    <body>
        <h1>☆ Hello ☆ World ☆</h1>
    </body>
    </html>
    HTML;
    
      $html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
      $dom = new DomDocument("1.0", "utf-8");
      $dom->loadHTML($html);
    
      header("Content-Type: text/html; charset=utf-8");
      echo($dom->saveHTML());
    

    Output:

    <!DOCTYPE html>
    <html><head><meta charset="utf-8"><title>Test!</title></head><body>
        <h1>&#9734; Hello &#9734; World &#9734;</h1>
    </body></html>
    
    评论
    解决 无用
    打赏 举报
  • dongqian3750 2013-06-05 04:55

    There's a faster fix for that, after loading your html document in DOMDocument, you just set (or better said reset) the original encoding. Here's a sample code:

    $dom = new DOMDocument();
    $dom->loadHTML('<?xml encoding="UTF-8">' . $html);
    
    foreach ($dom->childNodes as $item)
        if ($item->nodeType == XML_PI_NODE)
            $dom->removeChild($item);
    $dom->encoding = 'UTF-8'; // reset original encoding
    
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题