I want to create a PHP function that goes through a website's homepage, finds all the links in the homepage, goes through the links that it finds and keeps going until all the links on said website are final. I really need to build something like this so I can spider my network of sites and supply a "one stop" for searching.
Here's what I got so far -
function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
if(empty($current_array)) {
// Make the request to the original URL
$session = curl_init($urltospider);
curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($session);
curl_close($session);
if($html != '') {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if(!in_array($url, $ignore_array) && !in_array($url, $current_array)) {
// Add this URL to the current spider array
$current_array[] = $url;
}
}
} else {
die('Failed connection to the URL');
}
} else {
// There are already URLs in the current array
foreach($current_array as $url) {
// Connect to this URL
// Find all the links in this URL
// Go through each URL and get more links
}
}
}
The only problem is, I can't seem to get my head around how to proceed. Can anyone help me out? Basically, this function will repeat itself until everything has been found.