dsdfd2322 2011-06-11 08:28
浏览 54

如何在PHP的网页中获取链接列表? [重复]

Possible Duplicate:
Parse Website for URLs

How do I get all the links in a webpage using PHP?

I need to get a list of the links :-

Google

I want to fetch the href (http://www.google.com) and the text (Google)

-------------------situation is:-

I'm building a crawler and i want it to get all the links that exist in a database table.

  • 写回答

1条回答 默认 最新

  • doutale7115 2011-06-11 09:11
    关注

    There are a couple of ways to do this, but the way I would approach this is something like the following,

    Use cURL to fetch the page, ie:

    // $target_url has the url to be fetched, ie: "http://www.website.com"
    // $userAgent should be set to a friendly agent, sneaky but hey... 
    
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    
     $ch = curl_init();
     curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
     curl_setopt($ch, CURLOPT_URL,$target_url);
     curl_setopt($ch, CURLOPT_FAILONERROR, true);
     curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
     curl_setopt($ch, CURLOPT_AUTOREFERER, true);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
     curl_setopt($ch, CURLOPT_TIMEOUT, 10);
     $html = curl_exec($ch);
     if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
     }
    

    If all goes well, page content is now all in $html.

    Let's move on and load the page in a DOM Object:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    

    So far so good, XPath to the rescue to scrape the links out of the DOM object:

    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");
    

    Loop through the result and get the links:

    for ($i = 0; $i < $hrefs->length; $i++) {
     $href = $hrefs->item($i);
     $link = $href->getAttribute('href');
     $text = $href->nodeValue
    
         // Do what you want with the link, print it out:
         echo $text , ' -> ' , $link;
    
        // Or save this in an array for later processing..
        $links[$i]['href'] = $link;
        $links[$i]['text'] = $text;                         
    } 
    

    $hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.

    This should pretty much do it for you. The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.

    Hope this gives you an idea of how to scrape links, happy coding.

    评论

报告相同问题?

悬赏问题

  • ¥15 在matlab中Application Compiler后的软件无法打开
  • ¥15 想问一下STM32创建工程模板时遇到得问题
  • ¥15 Fiddler抓包443
  • ¥20 Qt Quick Android 项目报错及显示问题
  • ¥15 而且都没有 OpenCVConfig.cmake文件我是不是需要安装opencv,如何解决?
  • ¥15 oracleBIEE analytics
  • ¥15 H.264选择性加密例程
  • ¥50 windows的SFTP服务器如何能批量同步用户信息?
  • ¥15 centos7.9升级python3.0的问题
  • ¥15 安装CentOS6时卡住