dro62273 2013-12-31 09:55
浏览 192
已采纳

PHP简单的HTML DOM Parser返回乱码

$html = file_get_html('http://www.livelifedrive.com/');  
echo $html->plaintext;

I've no problem scraping other websites but this particular one returns gibberish.
Is it encrypted or something?

  • 写回答

4条回答 默认 最新

  • douwei1921 2013-12-31 10:38
    关注

    Actually, the gibberish you see is a GZIPed content.

    When I fetch the content with hurl.it for instance, here are the headers returned by server:

    GET http://www.livelifedrive.com/malaysia/ (the url http://www.livelifedrive.com/ resolves to http://www.livelifedrive.com/malaysia/)
    
    Connection: keep-alive
    Content-Encoding: gzip  <--- The content is gzipped
    Content-Length: 18202
    Content-Type: text/html; charset=UTF-8
    Date: Tue, 31 Dec 2013 10:35:42 GMT
    P3p: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
    Server: nginx/1.4.2
    Vary: Accept-Encoding,User-Agent
    X-Powered-By: PHP/5.2.17
    

    So once you have scraped the content, unzip it. Here is a sample code:

    if ( ! function_exists('gzdecode'))
    {
        /**
         * Decode gz coded data
         * 
         * http://php.net/manual/en/function.gzdecode.php
         * 
         * Alternative: http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping
         * 
         * @param string $data gzencoded data
         * @return string inflated data
         */
        function gzdecode($data) 
        {
            // strip header and footer and inflate
    
            return gzinflate(substr($data, 10, -8));
        }
    }
    

    References:

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥15 cmd cl 0x000007b
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统