I am trying to scrape a website's pages to get certain text content. New pages are always being added, so I want to be able to just increment through each page (using a fixed format URL) until I get a 404.
Pages are in this format:
http://thesite.com/page-1.html
http://thesite.com/page-2.html
http://thesite.com/page-3.html
...etc....
Everything runs smoothly until it hits the 36th page, then just dies (doesn't even hit the 404 test case). I know that there are about 100 pages that exist in this example, and I can manually view them all without a problem. Also, there is no error on the 36th page.
Test Case - I tried looping through http://google.com 50 times and had no problem with the cURL recursion. Just seems to be the website I actually want to cURL, or something with my server.
It seems to be some sort of limit either on the remote server, or my server, as I can run this page over and over again with no delay and I always get 36 pages read before it dies.
Can remote servers set a limit to cURL requests? Is there any other timeouts I need to increment? Is it a possible server memory issue?
**Recursive Scraping Function: ** (The $curl object is created in the first call to the method, then just passed by reference. I read this was better than creating and closing large amounts of cURL objects)
function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
//Format URL
$formattedURL = $preURL.$currentPage.$postURL;
echo "Formatted URL: ".$formattedURL."<br>";
echo "Count: ".$currentPage."<br>";
//Create CURL Object
curl_setopt($curl, CURLOPT_URL, $formattedURL);
//Set PHP Timeout
set_time_limit(0);// to infinity for example
//Check for 404
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if($httpCode == 404 || $currentPage == 50) {
curl_close($curl);
return 'PAGE NOT FOUND<br>';
}
//Set other CURL Options
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
$content = curl_exec($curl);
$html = str_get_html($content);
echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
if(is_array($html->find($parameters))>0){
foreach($html->find($parameters) as $element) {
echo "Text: ".$element->plaintext."<br>";
}
return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
}else{
echo "No Elements Found";
}
}