douyu5679 2014-05-22 16:51
浏览 71
已采纳

PHP cURL - 为什么脚本在第36次请求后死于远程URL?

I am trying to scrape a website's pages to get certain text content. New pages are always being added, so I want to be able to just increment through each page (using a fixed format URL) until I get a 404.

Pages are in this format:

http://thesite.com/page-1.html

http://thesite.com/page-2.html

http://thesite.com/page-3.html

...etc....

Everything runs smoothly until it hits the 36th page, then just dies (doesn't even hit the 404 test case). I know that there are about 100 pages that exist in this example, and I can manually view them all without a problem. Also, there is no error on the 36th page.

Test Case - I tried looping through http://google.com 50 times and had no problem with the cURL recursion. Just seems to be the website I actually want to cURL, or something with my server.

It seems to be some sort of limit either on the remote server, or my server, as I can run this page over and over again with no delay and I always get 36 pages read before it dies.

Can remote servers set a limit to cURL requests? Is there any other timeouts I need to increment? Is it a possible server memory issue?

**Recursive Scraping Function: ** (The $curl object is created in the first call to the method, then just passed by reference. I read this was better than creating and closing large amounts of cURL objects)

function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
        //Format URL
        $formattedURL = $preURL.$currentPage.$postURL;
        echo "Formatted URL: ".$formattedURL."<br>";
        echo "Count: ".$currentPage."<br>";
        //Create CURL Object
        curl_setopt($curl, CURLOPT_URL, $formattedURL);

        //Set PHP Timeout
        set_time_limit(0);// to infinity for example
        //Check for 404
        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
        if($httpCode == 404 || $currentPage == 50) {
            curl_close($curl);
            return 'PAGE NOT FOUND<br>';
        }
        //Set other CURL Options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0); 
        curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
        curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
        $content = curl_exec($curl);
        $html = str_get_html($content);
        echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
        if(is_array($html->find($parameters))>0){
            foreach($html->find($parameters) as $element) {
                echo "Text: ".$element->plaintext."<br>";
            }
            return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
        }else{
            echo "No Elements Found";
        }
    }
  • 写回答

1条回答 默认 最新

  • douzi5214 2014-05-22 23:05
    关注

    maybe its just memory limit problem try this(at the top of script).

    ini_set("memory_limit",-1);
    

    And also you said "... or something with my server" ,so if you can, just read your logs...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 abaqus随机生成二维颗粒
  • ¥15 安装ansys许可证管理器时出现了这个问题,如何解决?
  • ¥100 高价求算法,利用智能手机传感器计算车辆的三轴g值
  • ¥15 Blazor server 数据库操作异常,如何解决?(语言-c#)
  • ¥15 uni-app开发APP运行到浏览器访问接口跨域
  • ¥100 mfc消息自创建控件
  • ¥15 网页视频跳过后学习进度未增加
  • ¥15 研究生初试录取系统设计的c++
  • ¥30 通讯录程序设计 通讯录简单设计
  • ¥15 django project说mysqlclient的版本不够高