Let's say that I have a MySQL database with a large number of entries in it. Let's say 10k rows for now.
I have a task that I want to perform on each row of the table. The task can take anything from less than a second to a few seconds, but can be done in parallel on different rows of the database. The task involves reading the row, loading a URL via CURL or loading data from disk, updating some of the fields, and saving the updated data back to the row with an updated data and a timestamp for when they were processed.
My question is how should I best structure my execution of the task to achieve the following:
- Resilience to failures, like bad CURL responses and missing data. The system needs to be able to re-do a row after it fails.
- Avoiding duplicating effort. The system shouldn't ever load the same row for processing twice in two separate processes.
- The data should be loaded and saved in efficient batches to minimise the per-sql-call overhead.
- The system should be formatted in a way that takes up minimal memory footprint
- The system should require minimal human oversight to reach its end goal of all rows being parsed.
What I'm thinking is to have one process which reads the IDs of the set of rows that need processing. This can then be array_chunked into manageable sections which are passed to processes spawned on the shell with exec. The passing is done either through the database (mark rows 1-2000 for execution with process 1), by the command line, or by saving a CSV file.
The problem I see with that is that it'll leave some of the processes idle for a lot of the time. Once process might finish iots batch of 1000 only to find that another process has been a lot slower and still has 500 to go. This second process could easily speed things up by taking another 250 rows off the slower process.
I'm thinking there's probably a standard architecture that I'm missing here which is applied to this sort of process, or am I barking up the wrong tree?
Please stick to technologies that would be available on a standard LAMP server - I'm not really that interested in setting up Hadoop or rewriting my code into another language. Still, if there's another technology that would probably work on a lamp server then go ahead and suggest it.