PHP cURL web-scraper间歇性地返回错误“Recv failure：Connection was reset”

I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:

Recv failure: Connection was reset

Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:

$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');

//Some code that creates my SQL table.

//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){

   //3 array variables defined here, which I didn't include in this example.

   $path = $uniqueItems->href;
   $url = 'http://IPAddressOfSiteIAmScraping' . $path;

//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);

//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
    echo 'Scraping error: ' . curl_error($curl);
    exit; }
curl_close($curl);

//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
@$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.

//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[@id="wrapper"]//p)[@class="header"][1]');
$price = $xpath->query('//tr[@class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[@class="price_tr"]/td[3]'); 
$league = $xpath->query('//td[@class="left-column"]/p[1]');

//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
    $temp = new DOMDocument();
    $temp->appendChild($temp->importNode($e,TRUE));
    $val = $temp->saveHTML();
    $val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
    $val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
    $val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
    $final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
    $item['title'] = $final; //Here's the item name, ready for the SQL table.
}

//Here's a bunch of code where I write to my SQL table. Again, this part works great!

}

I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.

TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".

Hopefully someone can help me!! :) Thanks for reading even if you can't!

P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doumu2172 2016-05-07 00:22
关注
After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.

I did manage to work around it though, so I'll drop my workaround here...

$curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_URL, trim($url)); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0); curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for. $page = curl_exec($curl); if(curl_errno($curl)) { echo 'Scraping error: ' . curl_error($curl) . '</br>'; echo 'Dropping table...</br>'; $sql = "DROP TABLE table_item_info"; if (!mysqli_query($conn, $sql)) { echo "Could not drop table: " . mysqli_error($conn); } mysqli_close($conn); echo "TABLE has been dropped. Restarting.</br>"; goto start; exit; } curl_close($curl);

Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"

This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.

Hope this helps!
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP CURL错误：数据返回NULL值 json php
2019-04-23 10:19

回答 4 已采纳 You have to provide the full URL of the JSON file if you are using CURL, instead of the data.json
curl $errno出现53错误 php
2021-07-10 21:26

回答 1 已采纳解决了: 将PHP版本升级到7.3就没问题了. 7.2也会出现此问题
如何解决CURL请求错误“无法解析主机：...”PHP？ docker php
2019-06-19 21:16

回答 1 已采纳 I've had this problem before. I see in the chat that you said you had a backend and a frontend tha
php curl 出现 Recv failure: Connection was reset
2022-03-18 14:05

风海一粟的博客 php 使用 curl 出现：Empty reply from server 对应 curl_errno:52 Recv failure: Connection was reset 对应 curl_errno:56
PHP 神奇的问题 CURL访问微信接口报433错误。 php
2022-05-02 16:47

回答 2 已采纳开启ssl拓展了吗
Curl Web Scraper问题，错误的数组 php
2016-01-12 13:21

回答 1 已采纳 Try resetting the $post_items variable before entering the foreach loop. $post_items = array();
PHP cURL请求返回401，但与Postman一起使用 php postman
2017-02-25 19:43

回答 1 已采纳 401 error code is usually been because some authorize issues :- 401 Unauthorized The requ
PHP执行Curl时报错提示CURL ERROR: Recv failure: Connection reset by peer的解决方法
2020-10-25 17:38

主要介绍了PHP执行Curl时报错提示CURL ERROR: Recv failure: Connection reset by peer的解决方法,需要的朋友可以参考下
在PHP中使用cURL和x-www-form-urlencoded进行POST返回Access Denied php
2014-11-06 00:05

回答 1 已采纳 Can you try like that and see if it helps: curl_setopt_array($ch, array( CURLOPT_POST => T
PHP CURL 获取高德web API 时返回不全 php
2016-04-04 02:12

回答 2 已采纳你的代码没有问题，可以运行并获得数据，应该是你的文件编码格式出错了，你看看你的文件是不是gbk的
php-curl-class检查登录是否正常 php
2019-02-25 07:39

回答 1 已采纳 Okay, thanks to all for like to help. Now i find by myself the solution. ;-) It's all the time: "9
git clone-错误提示error: RPC failed； curl 56 Recv failure: Connection was reset
2023-10-15 23:07

炫暗东明007的博客有时git clone时出现git clone-错误提示error: RPC failed;原因：这里其实是电脑没有安装对应的ca证书，所以无法通过https连接到git服务器。
php curl返回400 Bad Request php
2019-07-16 16:26

回答 1 已采纳 @everyone. Thanks for your tips. Finally, my code works with following configuration almost time.
Gitlab push代码报错：RPC failed； curl 56 Recv failure: Connection was reset
2023-09-27 09:35

卷心菜windy的博客当时我认真的检查了我的代码，并不存在大文件或者超大文件，但抱着一试的心态还是使用lfs提交。...切记如果项目文件不是很大，谨慎使用lfs，可以使用lfs的替代方案。–local 特定配置文件,配置文件位置：$(git项目目录)...
【已解决】error: RPC failed； curl 28 Recv failure: Connection was reset
2023-05-23 21:41

云间花生牛轧糖的博客 curl 28 Recv failure: Connection was reset fatal: expected flush after ref listing 解决办法：运行代码： git config --global http.sslVerify "false" 来自(55条消息) 【已解决】error: RPC failed； curl ...
RPC failed； curl 56 Recv failure: Connection was reset.
2022-04-07 22:25

C路在脚下的博客 RPC failed; curl 56 Recv failure: Connection was reset. git 错误
php curl访问https站点 curl: (56) Recv failure: Connection was reset的问题
2018-04-09 11:32

cominglately的博客 curl访问https协议的网站的时候，可能会获得上面的错误分析 https协议是 ssl协议和http协议的组合,访问这类网站需要检查ssl证书，证书验证失败，没有访问权限解决 curl提供绕过ssl的选项 curl_setopt...
没有解决我的问题, 去提问

悬赏问题

¥15 电力市场出清matlab yalmip kkt 双层优化问题
¥30 ros小车路径规划实现不了，如何解决？(操作系统-ubuntu)
¥20 matlab yalmip kkt 双层优化问题
¥15 如何在3D高斯飞溅的渲染的场景中获得一个可控的旋转物体
¥88 实在没有想法，需要个思路
¥15 MATLAB报错输入参数太多
¥15 python中合并修改日期相同的CSV文件并按照修改日期的名字命名文件
¥15 有赏，i卡绘世画不出
¥15 如何用stata画出文献中常见的安慰剂检验图
¥15 c语言链表结构体数据插入

PHP cURL web-scraper间歇性地返回错误“Recv failure：Connection was reset”

1条回答 默认 最新

悬赏问题

1条回答默认最新