PHP cURL web-scraper间歇性地返回错误“Recv failure：Connection was reset”

I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:

Recv failure: Connection was reset

Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:

$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');

//Some code that creates my SQL table.

//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){

   //3 array variables defined here, which I didn't include in this example.

   $path = $uniqueItems->href;
   $url = 'http://IPAddressOfSiteIAmScraping' . $path;

//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);

//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl);
    exit; }
curl_close($curl);

//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.

//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]'); 
$league = $xpath->query('//td[@class="left-column"]/p[1]');

//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
    $temp = new DOMDocument();
    $temp->appendChild($temp->importNode($e,TRUE));
    $val = $temp->saveHTML();
    $val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
    $val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
    $val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
    $final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
    $item['title'] = $final; //Here's the item name, ready for the SQL table.
}

//Here's a bunch of code where I write to my SQL table. Again, this part works great!

}

I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.

TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".

Hopefully someone can help me!! :) Thanks for reading even if you can't!

P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doumu2172 2016-05-07 00:22
关注
After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.

I did manage to work around it though, so I'll drop my workaround here...

$curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_URL, trim($url)); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0); curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for. $page = curl_exec($curl); if(curl_errno($curl)) { echo 'Scraping error: ' . curl_error($curl) . '</br>'; echo 'Dropping table...</br>'; $sql = "DROP TABLE table_item_info"; if (!mysqli_query($conn, $sql)) { echo "Could not drop table: " . mysqli_error($conn); } mysqli_close($conn); echo "TABLE has been dropped. Restarting.</br>"; goto start; exit; } curl_close($curl);

Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"

This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.

Hope this helps!
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP CURL错误：数据返回NULL值 json php
2019-04-23 10:19

回答 4 已采纳 You have to provide the full URL of the JSON file if you are using CURL, instead of the data.json
curl $errno出现53错误 php
2021-07-10 21:26

回答 1 已采纳解决了: 将PHP版本升级到7.3就没问题了. 7.2也会出现此问题
如何解决CURL请求错误“无法解析主机：...”PHP？ docker php
2019-06-19 21:16

回答 1 已采纳 I've had this problem before. I see in the chat that you said you had a backend and a frontend tha
php curl 出现 Recv failure: Connection was reset
2022-03-18 14:05

风海一粟的博客 php 使用 curl 出现：Empty reply from server 对应 curl_errno:52 Recv failure: Connection was reset 对应 curl_errno:56
PHP 神奇的问题 CURL访问微信接口报433错误。 php
2022-05-02 16:47

回答 2 已采纳开启ssl拓展了吗
Curl Web Scraper问题，错误的数组 php
2016-01-12 13:21

回答 1 已采纳 Try resetting the $post_items variable before entering the foreach loop. $post_items = array();
PHP cURL请求返回401，但与Postman一起使用 php postman
2017-02-25 19:43

回答 1 已采纳 401 error code is usually been because some authorize issues :- 401 Unauthorized The requ
在PHP中使用cURL和x-www-form-urlencoded进行POST返回Access Denied php
2014-11-06 00:05

回答 1 已采纳 Can you try like that and see if it helps: curl_setopt_array($ch, array( CURLOPT_POST => T
PHP CURL 获取高德web API 时返回不全 php
2016-04-04 02:12

回答 2 已采纳你的代码没有问题，可以运行并获得数据，应该是你的文件编码格式出错了，你看看你的文件是不是gbk的
php-curl-class检查登录是否正常 php
2019-02-25 07:39

回答 1 已采纳 Okay, thanks to all for like to help. Now i find by myself the solution. ;-) It's all the time: "9
git clone-错误提示error: RPC failed； curl 56 Recv failure: Connection was reset
2023-10-15 23:07

炫暗东明007的博客有时git clone时出现git clone-错误提示error: RPC failed;原因：这里其实是电脑没有安装对应的ca证书，所以无法通过https连接到git服务器。
php curl返回400 Bad Request php
2019-07-16 16:26

回答 1 已采纳 @everyone. Thanks for your tips. Finally, my code works with following configuration almost time.
Gitlab push代码报错：RPC failed； curl 56 Recv failure: Connection was reset
2023-09-27 09:35

卷心菜windy的博客当时我认真的检查了我的代码，并不存在大文件或者超大文件，但抱着一试的心态还是使用lfs提交。...切记如果项目文件不是很大，谨慎使用lfs，可以使用lfs的替代方案。–local 特定配置文件,配置文件位置：$(git项目目录)...
【已解决】error: RPC failed； curl 28 Recv failure: Connection was reset
2023-05-23 21:41

云间花生牛轧糖的博客 curl 28 Recv failure: Connection was reset fatal: expected flush after ref listing 解决办法：运行代码： git config --global http.sslVerify "false" 来自(55条消息) 【已解决】error: RPC failed； curl ...
RPC failed； curl 56 Recv failure: Connection was reset.
2022-04-07 22:25

C路在脚下的博客 RPC failed; curl 56 Recv failure: Connection was reset. git 错误
RPC failed; curl 56 Recv failure: Connection was reset
2022-06-29 20:06

JAVA开发老菜鸟的博客 Git push代码的时候遇到错误：RPC failed; curl 56 Recv failure: Connection was reset 网上搜到的方法不好用~~ 最后通过重启电脑解决的。
没有解决我的问题, 去提问

悬赏问题

¥15 小程序中fit格式等运动数据文件怎样实现可视化？（包含心率信息））
¥15 如何利用mmdetection3d中的get_flops.py文件计算fcos3d方法的flops？
¥40 串口调试助手打开串口后,keil5的代码就停止了
¥15 电脑最近经常蓝屏，求大家看看哪的问题
¥60 高价有偿求java辅导。工程量较大，价格你定，联系确定辅导后将采纳你的答案。希望能给出完整详细代码，并能解释回答我关于代码的疑问疑问，代码要求如下，联系我会发文档
¥50 C++五子棋AI程序编写
¥30 求安卓设备利用一个typeC接口，同时实现向pc一边投屏一边上传数据的解决方案。
¥15 SQL Server analysis services 服务安装失败
¥15 基于面向对象的图书馆借阅管理系统
¥15 opencv图像处理，需要四个处理结果图

PHP cURL web-scraper间歇性地返回错误“Recv failure：Connection was reset”

1条回答 默认 最新

悬赏问题

1条回答默认最新