I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:
Recv failure: Connection was reset
Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:
$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');
//Some code that creates my SQL table.
//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){
//3 array variables defined here, which I didn't include in this example.
$path = $uniqueItems->href;
$url = 'http://IPAddressOfSiteIAmScraping' . $path;
//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl);
exit; }
curl_close($curl);
//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.
//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]');
$league = $xpath->query('//td[@class="left-column"]/p[1]');
//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
$temp = new DOMDocument();
$temp->appendChild($temp->importNode($e,TRUE));
$val = $temp->saveHTML();
$val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
$val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
$val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
$final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
$item['title'] = $final; //Here's the item name, ready for the SQL table.
}
//Here's a bunch of code where I write to my SQL table. Again, this part works great!
}
I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.
TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".
Hopefully someone can help me!! :) Thanks for reading even if you can't!
P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.