PHP cURL - 为什么脚本在第36次请求后死于远程URL？

I am trying to scrape a website's pages to get certain text content. New pages are always being added, so I want to be able to just increment through each page (using a fixed format URL) until I get a 404.

Pages are in this format:

http://thesite.com/page-1.html

http://thesite.com/page-2.html

http://thesite.com/page-3.html

...etc....

Everything runs smoothly until it hits the 36th page, then just dies (doesn't even hit the 404 test case). I know that there are about 100 pages that exist in this example, and I can manually view them all without a problem. Also, there is no error on the 36th page.

Test Case - I tried looping through http://google.com 50 times and had no problem with the cURL recursion. Just seems to be the website I actually want to cURL, or something with my server.

It seems to be some sort of limit either on the remote server, or my server, as I can run this page over and over again with no delay and I always get 36 pages read before it dies.

Can remote servers set a limit to cURL requests? Is there any other timeouts I need to increment? Is it a possible server memory issue?

**Recursive Scraping Function: ** (The $curl object is created in the first call to the method, then just passed by reference. I read this was better than creating and closing large amounts of cURL objects)

function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
        //Format URL
        $formattedURL = $preURL.$currentPage.$postURL;
        echo "Formatted URL: ".$formattedURL."<br>";
        echo "Count: ".$currentPage."<br>";
        //Create CURL Object
        curl_setopt($curl, CURLOPT_URL, $formattedURL);

        //Set PHP Timeout
        set_time_limit(0);// to infinity for example
        //Check for 404
        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
        if($httpCode == 404 || $currentPage == 50) {
            curl_close($curl);
            return 'PAGE NOT FOUND<br>';
        }
        //Set other CURL Options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0); 
        curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
        curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
        $content = curl_exec($curl);
        $html = str_get_html($content);
        echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
        if(is_array($html->find($parameters))>0){
            foreach($html->find($parameters) as $element) {
                echo "Text: ".$element->plaintext."<br>";
            }
            return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
        }else{
            echo "No Elements Found";
        }
    }

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douzi5214 2014-05-22 23:05
关注
maybe its just memory limit problem try this(at the top of script).

ini_set("memory_limit",-1);

And also you said "... or something with my server" ,so if you can, just read your logs...
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP CURL脚本在第一次请求后获得502/503服务器错误 php
2019-01-15 04:03

回答 1 已采纳 There is a mess of cookies in your snippet. The callback function just appends cookies to the arra
如何在PHP中使用curl -T / - upload-file和POST？ php
2016-02-12 12:08

回答 1 已采纳 If you add --libcurl code.c to your curl command line you'll see exactly which libcurl options it
cURL - 提示脚本工作，而curl-php的相同请求不起作用（内部服务器错误500） php
2014-11-20 22:02

回答 1 已采纳 I forgot to add the user agent to the two php Pages. I just deleted the comment option from the tw
php使用curl简单抓取远程url的方法
2020-10-24 13:46

在当今互联网高速发展的背景下，数据交换和信息获取是极为重要的功能之一，而PHP作为一门广泛使用的服务器端脚本语言，其与cURL（Client URL Library）的结合使用，为开发者提供了方便、高效地进行网络操作的能力。...
PHP / cURL脚本请求另一个PHP脚本并在响应之前退出 php
2014-02-03 01:53

回答 1 已采纳 Yes, if ignore_user_abort() is set to true on script B, then script B will continue to run, regard
如何在php curl中指定--form参数和PUT参数？ php
2014-06-20 20:03

回答 1 已采纳 So I was missing the following: $opts[CURLOPT_CUSTOMREQUEST] = "PUT"; Now it is working.
php curl post接口，返回bool(false)怎么解决？ php
2022-08-26 18:18

回答 3 已采纳你这个是要上传图片到别的服务器吧 $imgurl = 'E:\test.jpg';// 本地图片地址 $api = 'https://sp0.baidu.com/6_R1fD_bAAd3otqbppn
PHP基于curl后台远程登录正方教务系统的方法
2020-10-21 08:05

其中一个重要的技能就是如何通过PHP脚本使用cURL进行远程登录操作。本文将详细介绍如何使用PHP和cURL技术实现在后台远程登录正方教务系统的方法。通过这种方法，我们可以模拟浏览器的行为，从而实现程序化的自动化...
如何在PHP中使用cURL发送post请求后获取/重定向到下一页？ php
2014-11-07 21:59

回答 1 已采纳 As 'singin1.php' page is using redirection with header, and session. It is compulsory to tell cURL
为什么在第二次调用时，php“curl”功能无法正常工作？ php
2015-11-19 07:10

回答 2 已采纳 Try this code $ch = curl_init(); curl_setopt($ch,CURLOPT_URL,'http://huger.blog.ir/rss/'); curl_s
PHP CURL - 当你只知道id时刮掉seo url php
2018-08-10 10:52

回答 2 已采纳 Curl provides the option CURLOPT_FOLLOWLOCATION. curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true)
php curl form-data,关于curl：使用php中的multipart / form-data请求发送文件
2021-03-26 13:36

卞老板的博客我必须发送内容类型为multipart / form-data的请求。因此，当发送带有文件输入的表单并发送enctype = multipart / form-data属性时，我需要进行类似的请求。我尝试了这个：$url = 'here_is_url_for_web_API';$...
php：如何在curl方式下url请求域名使用指定ip地址来访问某个服务器
2022-09-18 11:05

蝶开三月的博客域名指向ip，如何在curl方式下url请求域名使用指定ip地址来访问某个服务器
php CURL中x-www-form-urlencoded POST 请求数据
2019-08-30 11:20

心虫的博客 if(! function_exists('curPost')) { /** * php curl中x-www-form-urlencoded 请求数据 * @param string $data 请求数据 * @param string $url 请求地址 * @return boolean [descriptio...
POST请求https接口-curl获取结果乱码解决-curl结果为空问题-PHP
2018-07-19 11:14

zhuxiongyin的博客 curl获取结果乱码的解决方法： ... 这样就要解释内容： curl_setopt($curl, CURLOPT_ENCODING, ...curl结果为空问题解决方法：如果是https协议使用结果为空,如下这样可以解决： curl_setopt($curl, CURLOPT_...
python request实现http get请求curl -u 用户名:密码 url -X get
2021-12-28 17:21

快乐骑行^_^的博客 python request实现http get请求curl -u 用户名:密码 url -X GET一、curl get请求命令二、requests.get...一、curl get请求命令把下面这条curl get请求命令用python代码实现 curl -u debezium:4a3s4d02234h http://p
php curl get请求传参,关于在php中使用curl发送get请求时参数传递问题的解析
2021-04-11 10:53

苏莞尔的博客 get请求是最简单的请求，不过要注意自己...GET请求的参数get传递参数和正常请求url传递参数的方式一样(免费在线视频教程分享：php视频教程)function get_info($card){$url ="http://www.sdt.com/api/White/CardInfo?...
没有解决我的问题, 去提问

悬赏问题

¥15 请问有用MZmine处理 “Waters SYNAPT G2-Si QTOF质谱仪在MSE模式下采集的非靶向数据” 的分析教程吗
¥50 opencv4nodejs 如何安装
¥15 adb push异常 adb: error: 1409-byte write failed: Invalid argument
¥15 nginx反向代理获取ip，java获取真实ip
¥15 eda：门禁系统设计
¥50 如何使用js去调用vscode-js-debugger的方法去调试网页
¥15 376.1电表主站通信协议下发指令全被否认问题
¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
¥15 复杂网络，变滞后传递熵，FDA
¥20 csv格式数据集预处理及模型选择

PHP cURL - 为什么脚本在第36次请求后死于远程URL？

1条回答 默认 最新

悬赏问题

1条回答默认最新