I recently had a task to iterate over a big table (~40KK records) in PostgreSQL using Propel and encountered performance issues, both memory limit and execution speed. My script had been running for 22(!) hours.
The task was to retrieve records based on some criteria (not active for the last 6 months) and archive them (move to another table) and all related entities from other tables.
The primary table, my script is working on, has several columns: id
, device_id
, application_id
, last_activity_date
and others, that don’t have any significant meaning here. This table contains information about applications installed on device and their last activity dates. There may be several records with the same device_id
and different application_id
. Here is a sample from the table:
id | device_id | application_id | last_init_date
----------+-----------+----------------+---------------------
1 | 1 | 1 | 2013-09-24 17:09:01
2 | 1 | 2 | 2013-09-19 20:36:23
3 | 1 | 3 | 2014-02-11 00:00:00
4 | 2 | 4 | 2013-09-29 20:12:54
5 | 3 | 5 | 2013-08-31 19:41:05
So, the device is considered to be old enough to be archived, if the maximum last_activity_date
for the particular device_id
in this table is older than 6 months. Here is the query:
SELECT device_id
FROM device_applications
GROUP BY device_id
HAVING MAX(last_init_date) < '2014-06-16 08:00:00'
In Propel it looks like:
\DeviceApplicationsQuery::create()
->select('DeviceId')
->groupByDeviceId()
->having('MAX(device_applications.LAST_INIT_DATE) < ?', $date->format('Y-m-d H:i:s'))
->find();
The resulting set, as you understand, is too big to fit in memory, so I have to split it somehow into chunks.
The question is: what is the best strategy to choose in this situation to decrease memory consumption and to speed up the script? In my answer I'll show you what I've found so far.