duangai1941 2019-01-02 08:49
浏览 83
已采纳

使用多卷曲获取所有网址

I'm working on an app that gets all the URLs from an array of sites and displays it in array form or JSON.

I can do it using for loop, the problem is the execution time when I tried 10 URLs it gives me an error saying exceeds maximum execution time.

Upon searching I found this multi curl

I also found this Fast PHP CURL Multiple Requests: Retrieve the content of multiple URLs using CURL. I tried to add my code but didn't work because I don't how to use the function.

Hope you help me.

Thanks.

This is my sample code.

<?php

$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');


$mh = curl_multi_init();
foreach ($urls as $i => $url) {

        $urlContent = file_get_contents($url);

        $dom = new DOMDocument();
        @$dom->loadHTML($urlContent);
        $xpath = new DOMXPath($dom);
        $hrefs = $xpath->evaluate("/html/body//a");

        for($i = 0; $i < $hrefs->length; $i++){
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
            $url = filter_var($url, FILTER_SANITIZE_URL);
            // validate url
            if(!filter_var($url, FILTER_VALIDATE_URL) === false){
                echo '<a href="'.$url.'">'.$url.'</a><br />';
            }
        }

        $conn[$i]=curl_init($url);
        $fp[$i]=fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
}
do {
    $n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
curl_multi_close($mh);
?>
  • 写回答

6条回答 默认 最新

  • down101102 2019-01-09 07:24
    关注

    Here is a function that I put together that will properly utilize the curl_multi_init() function. It is more or less the same function that you will find on PHP.net with some minor tweaks. I have had great success with this.

    function multi_thread_curl($urlArray, $optionArray, $nThreads) {
    
      //Group your urls into groups/threads.
      $curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);
    
      //Iterate through each batch of urls.
      $ch = 'ch_';
      foreach($curlArray as $threads) {      
    
          //Create your cURL resources.
          foreach($threads as $thread=>$value) {
    
          ${$ch . $thread} = curl_init();
    
            curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
            curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.
    
            }
    
          //Create the multiple cURL handler.
          $mh = curl_multi_init();
    
          //Add the handles.
          foreach($threads as $thread=>$value) {
    
          curl_multi_add_handle($mh, ${$ch . $thread});
    
          }
    
          $active = null;
    
          //execute the handles.
          do {
    
          $mrc = curl_multi_exec($mh, $active);
    
          } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    
          while ($active && $mrc == CURLM_OK) {
    
              if (curl_multi_select($mh) != -1) {
                  do {
    
                      $mrc = curl_multi_exec($mh, $active);
    
                  } while ($mrc == CURLM_CALL_MULTI_PERFORM);
              }
    
          }
    
          //Get your data and close the handles.
          foreach($threads as $thread=>$value) {
    
          $results[$thread] = curl_multi_getcontent(${$ch . $thread});
    
          curl_multi_remove_handle($mh, ${$ch . $thread});
    
          }
    
          //Close the multi handle exec.
          curl_multi_close($mh);
    
      }
    
    
      return $results;
    
    } 
    
    
    
    //Add whatever options here. The CURLOPT_URL is left out intentionally.
    //It will be added in later from the url array.
    $optionArray = array(
    
      CURLOPT_USERAGENT        => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',//Pick your user agent.
      CURLOPT_RETURNTRANSFER   => TRUE,
      CURLOPT_TIMEOUT          => 10
    
    );
    
    //Create an array of your urls.
    $urlArray = array(
    
        'http://site1.com/',
        'http://site2.com/',
        'http://site3.com/'
    
    );
    
    //Play around with this number and see what works best.
    //This is how many urls it will try to do at one time.
    $nThreads = 20;
    
    //To use run the function.
    $results = multi_thread_curl($urlArray, $optionArray, $nThreads);
    

    Once this is complete you will have an array containing all of the html from your list of websites. It is at this point where I would loop through them and pull out all of the urls.

    Like so:

    foreach($results as $page){
    
      $dom = new DOMDocument();
      @$dom->loadHTML($page);
      $xpath = new DOMXPath($dom);
      $hrefs = $xpath->evaluate("/html/body//a");
    
      for($i = 0; $i < $hrefs->length; $i++){
    
        $href = $hrefs->item($i);
        $url = $href->getAttribute('href');
        $url = filter_var($url, FILTER_SANITIZE_URL);
        // validate url
        if(!filter_var($url, FILTER_VALIDATE_URL) === false){
        echo '<a href="'.$url.'">'.$url.'</a><br />';
        }
    
      }
    
    }
    

    It is also worth keeping in the back of you head the ability to increase the run time of your script.

    If your using a hosting service you may be restricted to something in the ball park of two minutes regardless of what you set your max execution time to. Just food for thought.

    This is done by:

    ini_set('max_execution_time', 120);

    You can always try more time but you'll never know till you time it.

    Hope it helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(5条)

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大