douyu5679 2014-05-22 16:51
浏览 71

PHP cURL - 为什么脚本在第36次请求后死于远程URL?

I am trying to scrape a website's pages to get certain text content. New pages are always being added, so I want to be able to just increment through each page (using a fixed format URL) until I get a 404.

Pages are in this format:


Everything runs smoothly until it hits the 36th page, then just dies (doesn't even hit the 404 test case). I know that there are about 100 pages that exist in this example, and I can manually view them all without a problem. Also, there is no error on the 36th page.

Test Case - I tried looping through 50 times and had no problem with the cURL recursion. Just seems to be the website I actually want to cURL, or something with my server.

It seems to be some sort of limit either on the remote server, or my server, as I can run this page over and over again with no delay and I always get 36 pages read before it dies.

Can remote servers set a limit to cURL requests? Is there any other timeouts I need to increment? Is it a possible server memory issue?

**Recursive Scraping Function: ** (The $curl object is created in the first call to the method, then just passed by reference. I read this was better than creating and closing large amounts of cURL objects)

function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
        //Format URL
        $formattedURL = $preURL.$currentPage.$postURL;
        echo "Formatted URL: ".$formattedURL."<br>";
        echo "Count: ".$currentPage."<br>";
        //Create CURL Object
        curl_setopt($curl, CURLOPT_URL, $formattedURL);

        //Set PHP Timeout
        set_time_limit(0);// to infinity for example
        //Check for 404
        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
        if($httpCode == 404 || $currentPage == 50) {
            return 'PAGE NOT FOUND<br>';
        //Set other CURL Options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0); 
        curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
        curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
        $content = curl_exec($curl);
        $html = str_get_html($content);
        echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
            foreach($html->find($parameters) as $element) {
                echo "Text: ".$element->plaintext."<br>";
            return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
            echo "No Elements Found";
  • 写回答

1条回答 默认 最新

  • douzi5214 2014-05-22 23:05

    maybe its just memory limit problem try this(at the top of script).


    And also you said "... or something with my server" ,so if you can, just read your logs...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?



  • ¥15 abaqus随机生成二维颗粒
  • ¥15 安装ansys许可证管理器时出现了这个问题,如何解决?
  • ¥100 高价求算法,利用智能手机传感器计算车辆的三轴g值
  • ¥15 Blazor server 数据库操作异常,如何解决?(语言-c#)
  • ¥15 uni-app开发APP运行到浏览器访问接口跨域
  • ¥100 mfc消息自创建控件
  • ¥15 网页视频跳过后学习进度未增加
  • ¥15 研究生初试录取系统设计的c++
  • ¥30 通讯录程序设计 通讯录简单设计
  • ¥15 django project说mysqlclient的版本不够高