使用PHP并行下载页面

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.

I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.

In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.

The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.

My questions are

How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

dougai3418 2012-02-10 21:06

关注

When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:

public function launch() {
    $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
    $activeJobs = array();
    $running = 0;
    do {
        // pick jobs for free channels:
        while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
            // take free channel, (re)init curl handle and let
            // queued object set options
            $chId = key($freeChannels);
            if (empty($channels[$chId])) {
                $channels[$chId] = curl_init();
            }
            $job = array_pop($this->jobQueue);
            $job->init($channels[$chId]);
            curl_multi_add_handle($this->master, $channels[$chId]);
            $activeJobs[$chId] = $job;
            unset($freeChannels[$chId]);
        }
        $pending = count($activeJobs);

        // launch them:
        if ($pending > 0) {
            while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
                // poke it while it wants
            curl_multi_select($this->master);
                // wait for some activity, don't eat CPU
            while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
                // some connection(s) finished, locate that job and run response handler:
                $pending--;
                $chId = array_search($info['handle'], $channels);
                $content = curl_multi_getcontent($channels[$chId]);
                curl_multi_remove_handle($this->master, $channels[$chId]);
                $freeChannels[$chId] = NULL;
                    // free up this channel
                if ( !array_key_exists($chId, $activeJobs) ) {
                    // impossible, but...
                    continue;
                }
                $activeJobs[$chId]->onComplete($content);
                unset($activeJobs[$chId]);
            }
        }
    } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}

In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete. With this structure new requests will start as soon as something out of the pool finishes.

Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)

P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(4条)

报告相同问题？

关注问题

使用PHP并行下载页面 php
2012-02-10 19:07

回答 5 已采纳 When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 url
在PHP中合并行 php
2018-05-12 20:15

回答 1 已采纳 Yes. You can combine the two if statements into one like so: if (is_user_logged_in() && (is_page(
PHPList并行处理 php
2017-07-10 04:37

回答 1 已采纳 After much tinkering and trying to figure out why emails were being sent out so slow came down to
curl模块采集任意网页php类.zip
2019-07-11 11:02

支持单个目标并行多个get,post请求 * 4.支持ajax请求 * 5.支持自定义header请求 * 6.支持自定义编码数据请求（该情况比较特殊） * 7.支持代理登陆 * 8.支持自定义来路 * 9.支持自定义超时 * 10.支持文件上传
正确处理并行插入PHP / MySQL mysql php
2017-01-11 23:03

回答 1 已采纳 You can do the test and INSERT in a single query: INSERT INTO yourTable (columnName) SELECT :data
使用Golang的AWS S3并行下载
2019-01-29 11:34

回答 1 已采纳 Try altering your NewDownLoader() to this. See https://docs.aws.amazon.com/sdk-for-go/api/service
添加自定义变量到paypal并行支付IPN PHP php
2015-03-13 18:12

回答 2 已采纳 Unless paypal fixed something in the last 3 years, the ONLY way I could accomplish this was to sen
php并行运算,PHP 的并行如何实现？
2021-03-24 01:06

竹谭的博客 //10秒这个不会在网页上显示内容啊只有10 过后才会显示,一次性显示, 你的好像是死循环吧我也是新人并行未必需要用多线程～而且php的多线程库在cgi下鬼知道会发生什么情况，绝对要好好测试下，有兴趣的同学试试...
PHP与并行端口 php
2011-11-01 23:53

回答 1 已采纳 Writing to the parallel port is as simple as: file_put_contents("/dev/lp0", "See that was easy.")
在PHP中并行处理/分叉以加速检查大型数组 php
2014-07-04 18:23

回答 3 已采纳 The first thing you want to do is optimze your code to shorten the execution time as much as possi
停止PHP脚本的并行执行 php
2010-06-26 07:38

回答 3 已采纳 Your idea is basically correct, but tinkering with file locks generally leads to strange behaviour
php并行计算实现,如何实现PHP异步调用或者说并行计算
2021-04-13 01:25

寅成的博客需求当一个用户给多个好友发送邀请邮件时，当一个请求需要从很多个数据库中读取数据时，当一个页面需要大量计算又想快速响应时，我们都希望php能够做到异步执行，即并发地发邮件，并行地从多个数据库取数据，并行...
PHP中是否存在两个静态方法的并行调用？ php
2016-05-26 09:54

回答 1 已采纳 See Confirmation that PHP static variables do not persist across requests for the information you
kafka的PHP库(Composer).zip
2019-07-11 10:25

这种动作（网页浏览，搜索和其他用户的行动）是在现代网络上的许多社会功能的一个关键因素。这些数据通常是由于吞吐量的要求而通过处理日志和日志聚合来解决。对于像Hadoop的一样的日志数据和离线分析系统，但又...
php结合curl实现多线程抓取
2020-12-19 19:54

* @param array $array 并行网址 * @param int $timeout 超时时间 * @return array */ function Curl_http($array,$timeout){ $res = array(); $mh = curl_multi_init();//创建多个curl语柄 $startime = ...
没有解决我的问题, 去提问

悬赏问题

¥15 ikuai客户端多拨vpn，重启总是有个别重拨不上
¥20 关于#anlogic#sdram#的问题，如何解决？(关键词-performance)
¥15 相敏解调 matlab
¥15 求lingo代码和思路
¥15 公交车和无人机协同运输
¥15 stm32代码移植没反应
¥15 matlab基于pde算法图像修复，为什么只能对示例图像有效
¥100 连续两帧图像高速减法
¥15 如何绘制动力学系统的相图
¥15 对接wps接口实现获取元数据

码龄粉丝数原力等级 --

使用PHP并行下载页面

5条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

使用PHP并行下载页面

5条回答 默认 最新

悬赏问题

5条回答默认最新