在PHP中并行处理数据库中的数据

Let's say that I have a MySQL database with a large number of entries in it. Let's say 10k rows for now.

I have a task that I want to perform on each row of the table. The task can take anything from less than a second to a few seconds, but can be done in parallel on different rows of the database. The task involves reading the row, loading a URL via CURL or loading data from disk, updating some of the fields, and saving the updated data back to the row with an updated data and a timestamp for when they were processed.

My question is how should I best structure my execution of the task to achieve the following:

Resilience to failures, like bad CURL responses and missing data. The system needs to be able to re-do a row after it fails.
Avoiding duplicating effort. The system shouldn't ever load the same row for processing twice in two separate processes.
The data should be loaded and saved in efficient batches to minimise the per-sql-call overhead.
The system should be formatted in a way that takes up minimal memory footprint
The system should require minimal human oversight to reach its end goal of all rows being parsed.

What I'm thinking is to have one process which reads the IDs of the set of rows that need processing. This can then be array_chunked into manageable sections which are passed to processes spawned on the shell with exec. The passing is done either through the database (mark rows 1-2000 for execution with process 1), by the command line, or by saving a CSV file.

The problem I see with that is that it'll leave some of the processes idle for a lot of the time. Once process might finish iots batch of 1000 only to find that another process has been a lot slower and still has 500 to go. This second process could easily speed things up by taking another 250 rows off the slower process.

I'm thinking there's probably a standard architecture that I'm missing here which is applied to this sort of process, or am I barking up the wrong tree?

Please stick to technologies that would be available on a standard LAMP server - I'm not really that interested in setting up Hadoop or rewriting my code into another language. Still, if there's another technology that would probably work on a lamp server then go ahead and suggest it.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

在PHP中并行处理/分叉以加速检查大型数组 php
2014-07-04 18:23

回答 3 已采纳 The first thing you want to do is optimze your code to shorten the execution time as much as possi
stream并行流在工作中常用么 java
2022-02-17 17:37

回答 3 已采纳 1.一般用不到2.流的作用更多可以用在遍历list
在html表中显示并行数据 html mysql php
2014-01-28 14:47

回答 4 已采纳 I think you are looking for something like this? $resultTest = mysql_query("SELECT * FROM invent
云数据仓库实践：AWS Redshift在大数据储存分析上的落地经验分享
2023-10-22 16:57

黑夜开发者的博客本文从主流的数据仓库出发进行讲解，然后重点分析了一下这款产品的特点，及其在数据分析领域的优劣势。然后根据平时在工作用的实际应用实践，给出了常见的一些疑惑问题解答。最后通过一些日常使用的SQL分享，让大家...
PHPList并行处理 php
2017-07-10 04:37

回答 1 已采纳 After much tinkering and trying to figure out why emails were being sent out so slow came down to
在PHP中合并行 php
2018-05-12 20:15

回答 1 已采纳 Yes. You can combine the two if statements into one like so: if (is_user_logged_in() && (is_page(
如何在Golang中使用并行子测试处理父级测试拆卸
2018-12-27 19:54

回答 1 已采纳 In the Go Blog on subtests it's mentioned how to do this: func TestParallelSubtest(t *testing.T)
正确处理并行插入PHP / MySQL mysql php
2017-01-11 23:03

回答 1 已采纳 You can do the test and INSERT in a single query: INSERT INTO yourTable (columnName) SELECT :data
如何在Go中并行调用函数
2017-04-18 16:38

回答 1 已采纳 Keep in mind that goroutines provide concurrency, and concurrency is not parallelism. The problem
Golang中的并行处理
2014-08-03 16:06

回答 4 已采纳 Your code will run concurrently, but not in parallel. You can make it run in parallel by setting G
大数据分析-第三章 大数据存储和处理
2022-02-19 10:44

SpriCoder的博客第三章 大数据存储和处理
GBase 8c在进行数据恢复时，是否支持并行恢复？数据库
2022-05-30 13:59

回答 1 已采纳支持并行恢复，主机日志传输到备机时，备机日志落盘的同时，发送给重做恢复分发线程，分发线程根据日志类型和日志操作的数据页发给多个并行恢复线程进行日志重做，保证备机的重做速度跟上主机日志的产生速度。这样备
【大数据处理技术】期末复习整理
2020-07-19 21:24

鸽子不二的博客所用教材：《大数据技术原理与应用——概念、存储、处理、分析与应用（第2版）》，由厦门大学计算机科学系林子雨编著。教材官网：http://dblab.xmu.edu.cn/post/bigdata/ 慕课：...
大数据学习之分布式数据库HBase
2021-12-23 16:22

@从心的博客 HBase简介 HBase就是BigTable的开源实现，是一个稀疏的多维度的排序的映射表，可以用来存储非结构化和半结构化的松散数据，通过水平扩展的方式，允许几千台服务器去存储海量文件 ...数据操作方面，在关系数据库当中定义
没有解决我的问题, 去提问

悬赏问题

¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？
¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
¥15 手机接入宽带网线，如何释放宽带全部速度
¥30 关于#r语言#的问题：如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
¥15 ETLCloud 处理json多层级问题
¥15 matlab中使用gurobi时报错
¥15 这个主板怎么能扩出一两个sata口
¥15 不是，这到底错哪儿了😭
¥15 2020长安杯与连接网探
¥15 关于#matlab#的问题：在模糊控制器中选出线路信息，在simulink中根据线路信息生成速度时间目标曲线（初速度为20m/s，15秒后减为0的速度时间图像）我想问线路信息是什么

码龄粉丝数原力等级 --

在PHP中并行处理数据库中的数据

0条回答默认最新

悬赏问题

在PHP中并行处理数据库中的数据

0条回答 默认 最新

悬赏问题

0条回答默认最新