dragon071111
dragon071111
2017-03-13 17:37

PHP DOMDocument使用HTML5 doctype正确加载HTML UTF-8编码

已采纳

I am using PHP's DOMDocument class with HTML 5 document. But when I do, some utf-8 characters are "changed". I got  , ’, é etc....

Here is my code.

    $parsedUrl = 'http://www.futursparents.com/';

    $curl = curl_init();
    @curl_setopt_array($curl, [
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_TIMEOUT => 60,
            CURLOPT_CONNECTTIMEOUT => 30,
            CURLOPT_FOLLOWLOCATION => TRUE,
            CURLOPT_MAXREDIRS => 5,
            CURLOPT_AUTOREFERER => FALSE,
            CURLOPT_HEADER => TRUE, // FALSE
            CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
            CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
            CURLOPT_CERTINFO => TRUE,
            CURLOPT_LOW_SPEED_LIMIT => 200,
            CURLOPT_LOW_SPEED_TIME => 50,
            CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
            CURLOPT_PROXYTYPE => CURLPROXY_HTTP,
            CURLOPT_ENCODING => 'gzip,deflate',
            CURLOPT_URL => $parsedUrl,
        ]);
    $response = curl_exec($curl);
    $info = curl_getinfo($curl);
    $error = curl_error($curl);
    $headers = trim(substr($response, 0, curl_getinfo($curl, CURLINFO_HEADER_SIZE)));
    $content = substr($response, curl_getinfo($curl, CURLINFO_HEADER_SIZE));

    curl_close($curl);

    libxml_use_internal_errors(true);

    $domDoc = new DOMDocument();
    print_r($domDoc->encoding); // It's OK => UTF-8
    // Got   or s’ or &eacute etc....
    print_r($domDoc->saveHTML());

It seem to be an HTML5 doctype with a meta element like so <meta charset=utf-8">

If I add the charset meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8">, It's seem to be OK.

$domDoc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $content);
// No &ensp; or s&rsquo; or &eacute etc....
print_r($domDoc->saveHTML());

Do you think this is the right solution?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • douxiuyu2028 douxiuyu2028 4年前

    I found why.

    The DOM extension was built on libxml2 whose HTML parser was made for HTML 4. If an HTML5 doctype and a meta element like so <meta charset="utf-8"> HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities.

    However the HTML4-like version will work <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    Reference: UTF-8 with PHP DOMDocument loadHTML?

    点赞 评论 复制链接分享