dongxiezhuo8852 2017-02-16 00:37
浏览 67

在PHP中并行处理数据库中的数据

Let's say that I have a MySQL database with a large number of entries in it. Let's say 10k rows for now.

I have a task that I want to perform on each row of the table. The task can take anything from less than a second to a few seconds, but can be done in parallel on different rows of the database. The task involves reading the row, loading a URL via CURL or loading data from disk, updating some of the fields, and saving the updated data back to the row with an updated data and a timestamp for when they were processed.

My question is how should I best structure my execution of the task to achieve the following:

  1. Resilience to failures, like bad CURL responses and missing data. The system needs to be able to re-do a row after it fails.
  2. Avoiding duplicating effort. The system shouldn't ever load the same row for processing twice in two separate processes.
  3. The data should be loaded and saved in efficient batches to minimise the per-sql-call overhead.
  4. The system should be formatted in a way that takes up minimal memory footprint
  5. The system should require minimal human oversight to reach its end goal of all rows being parsed.

What I'm thinking is to have one process which reads the IDs of the set of rows that need processing. This can then be array_chunked into manageable sections which are passed to processes spawned on the shell with exec. The passing is done either through the database (mark rows 1-2000 for execution with process 1), by the command line, or by saving a CSV file.

The problem I see with that is that it'll leave some of the processes idle for a lot of the time. Once process might finish iots batch of 1000 only to find that another process has been a lot slower and still has 500 to go. This second process could easily speed things up by taking another 250 rows off the slower process.

I'm thinking there's probably a standard architecture that I'm missing here which is applied to this sort of process, or am I barking up the wrong tree?

Please stick to technologies that would be available on a standard LAMP server - I'm not really that interested in setting up Hadoop or rewriting my code into another language. Still, if there's another technology that would probably work on a lamp server then go ahead and suggest it.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
    • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
    • ¥15 手机接入宽带网线,如何释放宽带全部速度
    • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
    • ¥15 ETLCloud 处理json多层级问题
    • ¥15 matlab中使用gurobi时报错
    • ¥15 这个主板怎么能扩出一两个sata口
    • ¥15 不是,这到底错哪儿了😭
    • ¥15 2020长安杯与连接网探
    • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么