dongxun1978 2013-05-27 04:53
浏览 66
已采纳

使用php从url更快地获取内容

I am using php, I want to get the content from url in faster way.
Here is a code which I use.
Code:(1)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    echo $content;
?>

Here is many other method to read files like fopen(), readfile() etc. But I think file_get_contents() is faster than these method.

In my above code when you execute it you see that it give every thing from this website even images and ads. I want to get only plan html text no css-style, images and ads. How can I get this.
See this to understand.
CODE:(2)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    // do something to remove css-style, images and ads.
    // return the plain html text in $mod_content.
    echo $mod_content;
?>

If I do that like above then I am going in wrong way, because I already get the full content in variable $content and then modify it.
Can here is any function method or anything else which get the directly plain html text from url.

Below code is written only to understanding, this is not the original php code.
IDEAL CODE:(3);

<?php
    $plain_content = get_plain_html('http://www.filehippo.com');
    echo $plain_content; // no css-style, images and ads.
?>

If I can get this function it will be much faster than others. Can it is possible.
Thanks.

  • 写回答

2条回答 默认 最新

  • duanqia9034 2013-05-27 05:34
    关注

    Try this.

    $content = file_get_contents('http://www.filehippo.com');
    $this->html =  $content;
    $this->process();
    function process(){
    
        // header
        $this->_replace('/.*<head>/ism', "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE html PUBLIC '-//WAPFORUM//DTD XHTML Mobile 1.0//EN' 'http://www.wapforum.org/DTD/xhtml-mobile10.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head>");
    
        // title
        $this->_replace('/<head>.*?(<title>.*<\/title>).*?<\/head>/ism', '<head>$1</head>');
    
        // strip out divs with little content
        $this->_stripContentlessDivs();
    
        // divs/p
        $this->_replace('/<div[^>]*>/ism', '') ;
        $this->_replace('/<\/div>/ism','<br/><br/>');
        $this->_replace('/<p[^>]*>/ism','');
        $this->_replace('/<\/p>/ism', '<br/>') ;
    
        // h tags
        $this->_replace('/<h[1-5][^>]*>(.*?)<\/h[1-5]>/ism', '<br/><b>$1</b><br/><br/>') ;
    
    
        // remove align/height/width/style/rel/id/class tags
        $this->_replace('/\salign=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\sheight=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\swidth=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\sstyle=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\srel=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\sid=(\'?\"?).*?\\1/ism','');
        $this->_replace('/\sclass=(\'?\"?).*?\\1/ism','');
    
        // remove coments
        $this->_replace('/<\!--.*?-->/ism','');
    
        // remove script/style
        $this->_replace('/<script[^>]*>.*?\/script>/ism','');
        $this->_replace('/<style[^>]*>.*?\/style>/ism','');
    
        // multiple 
    
        $this->_replace('/
    {2,}/ism','');
    
        // remove multiple <br/>
        $this->_replace('/(<br\s?\/?>){2}/ism','<br/>');
        $this->_replace('/(<br\s?\/?>\s*){3,}/ism','<br/><br/>');
    
        //tables
        $this->_replace('/<table[^>]*>/ism', '');
        $this->_replace('/<\/table>/ism', '<br/>');
        $this->_replace('/<(tr|td|th)[^>]*>/ism', '');
        $this->_replace('/<\/(tr|td|th)[^>]*>/ism', '<br/>');
    
        // wrap and close
    
    }
    private function _replace($pattern, $replacement, $limit=-1){
        $this->html = preg_replace($pattern, $replacement, $this->html, $limit);
    }
    

    for more - https://code.google.com/p/phpmobilizer/

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥50 永磁型步进电机PID算法
  • ¥15 sqlite 附加(attach database)加密数据库时,返回26是什么原因呢?
  • ¥88 找成都本地经验丰富懂小程序开发的技术大咖
  • ¥15 如何处理复杂数据表格的除法运算
  • ¥15 如何用stc8h1k08的片子做485数据透传的功能?(关键词-串口)
  • ¥15 有兄弟姐妹会用word插图功能制作类似citespace的图片吗?
  • ¥200 uniapp长期运行卡死问题解决
  • ¥15 latex怎么处理论文引理引用参考文献
  • ¥15 请教:如何用postman调用本地虚拟机区块链接上的合约?
  • ¥15 为什么使用javacv转封装rtsp为rtmp时出现如下问题:[h264 @ 000000004faf7500]no frame?