douyu5679 2014-05-22 16:51
浏览 71
已采纳

PHP cURL - 为什么脚本在第36次请求后死于远程URL?

I am trying to scrape a website's pages to get certain text content. New pages are always being added, so I want to be able to just increment through each page (using a fixed format URL) until I get a 404.

Pages are in this format:

http://thesite.com/page-1.html

http://thesite.com/page-2.html

http://thesite.com/page-3.html

...etc....

Everything runs smoothly until it hits the 36th page, then just dies (doesn't even hit the 404 test case). I know that there are about 100 pages that exist in this example, and I can manually view them all without a problem. Also, there is no error on the 36th page.

Test Case - I tried looping through http://google.com 50 times and had no problem with the cURL recursion. Just seems to be the website I actually want to cURL, or something with my server.

It seems to be some sort of limit either on the remote server, or my server, as I can run this page over and over again with no delay and I always get 36 pages read before it dies.

Can remote servers set a limit to cURL requests? Is there any other timeouts I need to increment? Is it a possible server memory issue?

**Recursive Scraping Function: ** (The $curl object is created in the first call to the method, then just passed by reference. I read this was better than creating and closing large amounts of cURL objects)

function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
        //Format URL
        $formattedURL = $preURL.$currentPage.$postURL;
        echo "Formatted URL: ".$formattedURL."<br>";
        echo "Count: ".$currentPage."<br>";
        //Create CURL Object
        curl_setopt($curl, CURLOPT_URL, $formattedURL);

        //Set PHP Timeout
        set_time_limit(0);// to infinity for example
        //Check for 404
        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
        if($httpCode == 404 || $currentPage == 50) {
            curl_close($curl);
            return 'PAGE NOT FOUND<br>';
        }
        //Set other CURL Options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0); 
        curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
        curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
        $content = curl_exec($curl);
        $html = str_get_html($content);
        echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
        if(is_array($html->find($parameters))>0){
            foreach($html->find($parameters) as $element) {
                echo "Text: ".$element->plaintext."<br>";
            }
            return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
        }else{
            echo "No Elements Found";
        }
    }
  • 写回答

1条回答 默认 最新

  • douzi5214 2014-05-22 23:05
    关注

    maybe its just memory limit problem try this(at the top of script).

    ini_set("memory_limit",-1);
    

    And also you said "... or something with my server" ,so if you can, just read your logs...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 #MATLAB仿真#车辆换道路径规划
  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘