I have a tricky problem. I am on a basic shared hosting. I have created a good scraping script using curl and php.
Because multi-threading with Curl is not really multi-threading and even the best curl multi-threading scripts I have used are speeding by 1,5-2 the scraping, I came to the conclusion that I need to run massive amount of cron tasks (like 50) per minute on my php script that interacts with a mysql table in order to offer fast web scraping to my customers.
My problem is that I get a "Mysql server has gone away" when having lots of cron tasks running at the same time. If I decrease the number of cron tasks, it continues to work but always slow.
I have also tried a browser-based solution by reloading the script every time the while is finished. It works better but always the same problem: When I decide to run 10 times the script at the same time, it begins to overload the mysql server or the web server (i don't know)
To resolve this, I have acquired an mysql server where I can set the my.cnf ...but the problem stays approximatively the same.
========= MY QUESTION IS : WHERE THE PROBLEM IS COMING FROM ? TABLE SIZE ? I NEED A BIG 100MBPS DEDICATED SERVER. IF YES, ARE YOU SURE IT WILL RESOLVE THE PROBLEM, AND HOW FAST IT IS ? BY KNOWING I WANT THAT THE EXTRACTION SPEED GOES TO APPROXIMATIVELY 100 URLS PER SECOND (at this time, it goes to 1 URL per 15 seconds, incredibly slow...)
There is only one while on the script. It loads all the page and preg match or dom data and insert into mysql database.
I extract lots of data, this is why a table fastly contain millions of entries...but when I remove them, maybe it goes a bit faster but it is always the same problem: it is impossible to massively run tasks in parallel in order to accelerate the process.
I don't think the problem is coming from my script. In all the cases, even optimized perfectly, I will not go as fast as I want.
I ested by using the script withotu proxies for scraping, but the difference is very small..not significant..
My conclusion is that I need to use a dedicated server but I don't want to invest like 100$ per month if I am not sure It will resolve the problem and I will be able to run these massive amounts of cron tasks / calls on the mysql db without problem.