dtdd25012 2012-02-10 19:07
浏览 31
已采纳

使用PHP并行下载页面

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.

I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.

In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.

The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.

My questions are

  • How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
  • Any location to store downloaded data, such as shared memory. Of course, other than database.
  • There any chances that data gets corrupt while storing and retrieving, how to avoid this?
  • Also, please guide me know if anyone have a better plan.
  • 写回答

5条回答 默认 最新

  • dougai3418 2012-02-10 21:06
    关注

    When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:

    public function launch() {
        $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
        $activeJobs = array();
        $running = 0;
        do {
            // pick jobs for free channels:
            while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
                // take free channel, (re)init curl handle and let
                // queued object set options
                $chId = key($freeChannels);
                if (empty($channels[$chId])) {
                    $channels[$chId] = curl_init();
                }
                $job = array_pop($this->jobQueue);
                $job->init($channels[$chId]);
                curl_multi_add_handle($this->master, $channels[$chId]);
                $activeJobs[$chId] = $job;
                unset($freeChannels[$chId]);
            }
            $pending = count($activeJobs);
    
            // launch them:
            if ($pending > 0) {
                while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
                    // poke it while it wants
                curl_multi_select($this->master);
                    // wait for some activity, don't eat CPU
                while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
                    // some connection(s) finished, locate that job and run response handler:
                    $pending--;
                    $chId = array_search($info['handle'], $channels);
                    $content = curl_multi_getcontent($channels[$chId]);
                    curl_multi_remove_handle($this->master, $channels[$chId]);
                    $freeChannels[$chId] = NULL;
                        // free up this channel
                    if ( !array_key_exists($chId, $activeJobs) ) {
                        // impossible, but...
                        continue;
                    }
                    $activeJobs[$chId]->onComplete($content);
                    unset($activeJobs[$chId]);
                }
            }
        } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
    }
    

    In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete. With this structure new requests will start as soon as something out of the pool finishes.

    Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)

    P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 ikuai客户端多拨vpn,重启总是有个别重拨不上
  • ¥20 关于#anlogic#sdram#的问题,如何解决?(关键词-performance)
  • ¥15 相敏解调 matlab
  • ¥15 求lingo代码和思路
  • ¥15 公交车和无人机协同运输
  • ¥15 stm32代码移植没反应
  • ¥15 matlab基于pde算法图像修复,为什么只能对示例图像有效
  • ¥100 连续两帧图像高速减法
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据