duancheng3042 2016-05-06 22:45
浏览 450

PHP cURL web-scraper间歇性地返回错误“Recv failure:Connection was reset”

I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:

Recv failure: Connection was reset

Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:

$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');

//Some code that creates my SQL table.

//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){

   //3 array variables defined here, which I didn't include in this example.

   $path = $uniqueItems->href;
   $url = 'http://IPAddressOfSiteIAmScraping' . $path;

//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);

//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl);
    exit; }
curl_close($curl);

//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.

//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]'); 
$league = $xpath->query('//td[@class="left-column"]/p[1]');

//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
    $temp = new DOMDocument();
    $temp->appendChild($temp->importNode($e,TRUE));
    $val = $temp->saveHTML();
    $val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
    $val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
    $val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
    $final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
    $item['title'] = $final; //Here's the item name, ready for the SQL table.
}

//Here's a bunch of code where I write to my SQL table. Again, this part works great!

}

I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.

TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".

Hopefully someone can help me!! :) Thanks for reading even if you can't!

P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.

  • 写回答

1条回答

  • doumu2172 2016-05-07 00:22
    关注

    After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.

    I did manage to work around it though, so I'll drop my workaround here...

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_URL, trim($url));
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
    curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
    $page = curl_exec($curl);
    if(curl_errno($curl)) {
        echo 'Scraping error: ' . curl_error($curl) . '</br>';
        echo 'Dropping table...</br>';
        $sql = "DROP TABLE table_item_info";
            if (!mysqli_query($conn, $sql)) {
                echo "Could not drop table: " . mysqli_error($conn);
            }
        mysqli_close($conn);
        echo "TABLE has been dropped. Restarting.</br>";
        goto start;
        exit; }
    curl_close($curl);
    

    Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"

    This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.

    Hope this helps!

    评论

报告相同问题?

悬赏问题

  • ¥20 双层网络上信息-疾病传播
  • ¥50 paddlepaddle pinn
  • ¥20 idea运行测试代码报错问题
  • ¥15 网络监控:网络故障告警通知
  • ¥15 django项目运行报编码错误
  • ¥15 请问这个是什么意思?
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services
  • ¥15 关于#c语言#的问题:我现在在做一个墨水屏设计,2.9英寸的小屏怎么换4.2英寸大屏
  • ¥15 模糊pid与pid仿真结果几乎一样