PHP cURL web-scraper间歇性地返回错误“Recv failure:Connection was reset”

I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:

Recv failure: Connection was reset

Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:

$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');

//Some code that creates my SQL table.

//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){

   //3 array variables defined here, which I didn't include in this example.

   $path = $uniqueItems->href;
   $url = 'http://IPAddressOfSiteIAmScraping' . $path;

//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);

//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl);
    exit; }
curl_close($curl);

//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.

//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]'); 
$league = $xpath->query('//td[@class="left-column"]/p[1]');

//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
    $temp = new DOMDocument();
    $temp->appendChild($temp->importNode($e,TRUE));
    $val = $temp->saveHTML();
    $val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
    $val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
    $val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
    $final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
    $item['title'] = $final; //Here's the item name, ready for the SQL table.
}

//Here's a bunch of code where I write to my SQL table. Again, this part works great!

}

I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.

TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".

Hopefully someone can help me!! :) Thanks for reading even if you can't!

P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.

duangan9251
duangan9251 谢谢Chay22!在发布之前我确实看到了这一点,但我并不是100%确定我的问题是否相同。我的错误是缺少“bypeer”部分。你认为假设它被同伴重置是安全的吗?
4 年多之前 回复
dongshang6062
dongshang6062 可能发现有用的stackoverflow.com/questions/10285700/...
4 年多之前 回复

1个回答

After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.

I did manage to work around it though, so I'll drop my workaround here...

$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl) . '</br>';
    echo 'Dropping table...</br>';
    $sql = "DROP TABLE table_item_info";
        if (!mysqli_query($conn, $sql)) {
            echo "Could not drop table: " . mysqli_error($conn);
        }
    mysqli_close($conn);
    echo "TABLE has been dropped. Restarting.</br>";
    goto start;
    exit; }
curl_close($curl);

Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"

This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.

Hope this helps!

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问
相关内容推荐