dro62273 2013-12-31 09:55
浏览 192
已采纳

PHP简单的HTML DOM Parser返回乱码

$html = file_get_html('http://www.livelifedrive.com/');  
echo $html->plaintext;

I've no problem scraping other websites but this particular one returns gibberish.
Is it encrypted or something?

  • 写回答

4条回答 默认 最新

  • douwei1921 2013-12-31 10:38
    关注

    Actually, the gibberish you see is a GZIPed content.

    When I fetch the content with hurl.it for instance, here are the headers returned by server:

    GET http://www.livelifedrive.com/malaysia/ (the url http://www.livelifedrive.com/ resolves to http://www.livelifedrive.com/malaysia/)
    
    Connection: keep-alive
    Content-Encoding: gzip  <--- The content is gzipped
    Content-Length: 18202
    Content-Type: text/html; charset=UTF-8
    Date: Tue, 31 Dec 2013 10:35:42 GMT
    P3p: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
    Server: nginx/1.4.2
    Vary: Accept-Encoding,User-Agent
    X-Powered-By: PHP/5.2.17
    

    So once you have scraped the content, unzip it. Here is a sample code:

    if ( ! function_exists('gzdecode'))
    {
        /**
         * Decode gz coded data
         * 
         * http://php.net/manual/en/function.gzdecode.php
         * 
         * Alternative: http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping
         * 
         * @param string $data gzencoded data
         * @return string inflated data
         */
        function gzdecode($data) 
        {
            // strip header and footer and inflate
    
            return gzinflate(substr($data, 10, -8));
        }
    }
    

    References:

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥50 silvaco GaN HEMT有栅极场板的击穿电压仿真问题
  • ¥15 谁会P4语言啊,我想请教一下
  • ¥20 win11无法启动 持续蓝屏且系统还原失败,无法开启系统保护
  • ¥15 哪个tomcat中startup一直一闪而过 找不出问题
  • ¥15 这个怎么改成直流激励源给加热电阻提供5a电流呀
  • ¥50 求解vmware的网络模式问题 别拿AI回答
  • ¥24 EFS加密后,在同一台电脑解密出错,证书界面找不到对应指纹的证书,未备份证书,求在原电脑解密的方法,可行即采纳
  • ¥15 springboot 3.0 实现Security 6.x版本集成
  • ¥15 PHP-8.1 镜像无法用dockerfile里的CMD命令启动 只能进入容器启动,如何解决?(操作系统-ubuntu)
  • ¥30 请帮我解决一下下面六个代码