dphw5101 2013-02-12 16:31
浏览 57

curl php通过共享主机每分钟抓取cron作业

I have a tricky problem. I am on a basic shared hosting. I have created a good scraping script using curl and php.

Because multi-threading with Curl is not really multi-threading and even the best curl multi-threading scripts I have used are speeding by 1,5-2 the scraping, I came to the conclusion that I need to run massive amount of cron tasks (like 50) per minute on my php script that interacts with a mysql table in order to offer fast web scraping to my customers.

My problem is that I get a "Mysql server has gone away" when having lots of cron tasks running at the same time. If I decrease the number of cron tasks, it continues to work but always slow.

I have also tried a browser-based solution by reloading the script every time the while is finished. It works better but always the same problem: When I decide to run 10 times the script at the same time, it begins to overload the mysql server or the web server (i don't know)

To resolve this, I have acquired an mysql server where I can set the my.cnf ...but the problem stays approximatively the same.

========= MY QUESTION IS : WHERE THE PROBLEM IS COMING FROM ? TABLE SIZE ? I NEED A BIG 100MBPS DEDICATED SERVER. IF YES, ARE YOU SURE IT WILL RESOLVE THE PROBLEM, AND HOW FAST IT IS ? BY KNOWING I WANT THAT THE EXTRACTION SPEED GOES TO APPROXIMATIVELY 100 URLS PER SECOND (at this time, it goes to 1 URL per 15 seconds, incredibly slow...)

  • There is only one while on the script. It loads all the page and preg match or dom data and insert into mysql database.

  • I extract lots of data, this is why a table fastly contain millions of entries...but when I remove them, maybe it goes a bit faster but it is always the same problem: it is impossible to massively run tasks in parallel in order to accelerate the process.

  • I don't think the problem is coming from my script. In all the cases, even optimized perfectly, I will not go as fast as I want.

  • I ested by using the script withotu proxies for scraping, but the difference is very small..not significant..

My conclusion is that I need to use a dedicated server but I don't want to invest like 100$ per month if I am not sure It will resolve the problem and I will be able to run these massive amounts of cron tasks / calls on the mysql db without problem.

  • 写回答

2条回答 默认 最新

  • douwuying4709 2018-02-24 20:59
    关注

    It's so easy... never send multithreading on the same URL. May be many different URLs. But try to respect a certain timeout. You can do that with:

    sleep($random);  $random = random(15, 35) ; // in seconds
    
    评论

报告相同问题?

悬赏问题

  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类