如何在PHP的网页中获取链接列表？ [重复]

Possible Duplicate:
Parse Website for URLs

How do I get all the links in a webpage using PHP?

I need to get a list of the links :-

Google

I want to fetch the href (http://www.google.com) and the text (Google)

-------------------situation is:-

I'm building a crawler and i want it to get all the links that exist in a database table.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

doutale7115 2011-06-11 09:11

关注

There are a couple of ways to do this, but the way I would approach this is something like the following,

Use cURL to fetch the page, ie:

// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey... 

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
 curl_setopt($ch, CURLOPT_URL,$target_url);
 curl_setopt($ch, CURLOPT_FAILONERROR, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_AUTOREFERER, true);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 $html = curl_exec($ch);
 if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
 }

If all goes well, page content is now all in $html.

Let's move on and load the page in a DOM Object:

$dom = new DOMDocument();
@$dom->loadHTML($html);

So far so good, XPath to the rescue to scrape the links out of the DOM object:

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

Loop through the result and get the links:

for ($i = 0; $i < $hrefs->length; $i++) {
 $href = $hrefs->item($i);
 $link = $href->getAttribute('href');
 $text = $href->nodeValue

     // Do what you want with the link, print it out:
     echo $text , ' -> ' , $link;

    // Or save this in an array for later processing..
    $links[$i]['href'] = $link;
    $links[$i]['text'] = $text;                         
}

$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.

This should pretty much do it for you. The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.

Hope this gives you an idea of how to scrape links, happy coding.