doujuan9698 2014-12-16 09:58
浏览 122
已采纳

用Propel遍历PostgreSQL中的一个大表

I recently had a task to iterate over a big table (~40KK records) in PostgreSQL using Propel and encountered performance issues, both memory limit and execution speed. My script had been running for 22(!) hours.

The task was to retrieve records based on some criteria (not active for the last 6 months) and archive them (move to another table) and all related entities from other tables.

The primary table, my script is working on, has several columns: id, device_id, application_id, last_activity_date and others, that don’t have any significant meaning here. This table contains information about applications installed on device and their last activity dates. There may be several records with the same device_id and different application_id. Here is a sample from the table:

     id    | device_id | application_id |   last_init_date
 ----------+-----------+----------------+---------------------
         1 |         1 |              1 | 2013-09-24 17:09:01
         2 |         1 |              2 | 2013-09-19 20:36:23
         3 |         1 |              3 | 2014-02-11 00:00:00
         4 |         2 |              4 | 2013-09-29 20:12:54
         5 |         3 |              5 | 2013-08-31 19:41:05

So, the device is considered to be old enough to be archived, if the maximum last_activity_date for the particular device_id in this table is older than 6 months. Here is the query:

SELECT device_id
FROM device_applications
GROUP BY device_id
HAVING MAX(last_init_date) < '2014-06-16 08:00:00'

In Propel it looks like:

\DeviceApplicationsQuery::create()
  ->select('DeviceId')
  ->groupByDeviceId()
  ->having('MAX(device_applications.LAST_INIT_DATE) < ?', $date->format('Y-m-d H:i:s'))
  ->find();

The resulting set, as you understand, is too big to fit in memory, so I have to split it somehow into chunks.

The question is: what is the best strategy to choose in this situation to decrease memory consumption and to speed up the script? In my answer I'll show you what I've found so far.

  • 写回答

2条回答 默认 最新

  • dongpo2340 2014-12-16 09:58
    关注

    I know three strategies of traversing a big table.

    1. Good old limit/offset

    The problem with this approach is that the database actually examines the records, that you want to skip with OFFSET. Here is a quote from the doc:

    The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large > OFFSET might be inefficient.

    Here is a simple example (not my initial query):

    explain (analyze)
    SELECT *
    FROM device_applications
    ORDER BY device_id
    LIMIT 100
    OFFSET 300;
    

    Execution plan:

    Limit  (cost=37.93..50.57 rows=100 width=264) (actual time=0.630..0.835 rows=100 loops=1)
        ->  Index Scan using device_applications_device_id_application_id_unique on device_applications  (cost=0.00..5315569.97 rows=42043256 width=264) (actual time=0.036..0.806 rows=400 loops=1)
    Total runtime: 0.873 ms
    

    Pay special attention to the actual results in Index scan section. It shows, that PostgreSQL worked with 400 records, which is offset (300) plus limit (100). So this approach is quite inefficient, especially taking into consideration the complexity of the initial query.

    2. Ranging by some column

    We can avoid the limitations of the limit/offset approach by making the query work with ranges of the table, which are made by slicing the table by a column.

    To clarify, let’s imaging you have a table with 100 records, you can divide this table into five ranges by 20 records in each: 0 - 20, 20 - 40, 40 - 60, 60 - 80, 80 - 100, and then work with the smaller subsets. In my case the column we can range by is device_id. The query looks like this:

    SELECT device_id
    FROM device_applications
    WHERE device_id >= 1 AND device_id < 1000
    GROUP BY device_id
    HAVING MAX(last_init_date) < '2014-06-16 08:00:00';
    

    It groups records by device_id, extracts the range and applies the condition on last_init_date. Of course, it may be (and will be in most cases) that there will be no records matching the condition. So, the problem with this approach is that you have to scan the whole table, even if the records you want to find are just 5% of all the records.

    3. Using cursors

    What we need is a cursor. Cursors allow to iterate over the result set without fetch the whole data at once. In PHP you make use of cursors when you iterate over a PDOStatement. A simple example:

    $stmt = $dbh->prepare("SELECT * FROM table");
    $stmt->execute();
    
    // Iterate over statement using a cursor
    foreach ($stmt as $row) {
        // Do something
    }
    

    In Propel you can make use of this PDO's feature with a PropelOnDemandFormatter class. So, the final code:

    $devApps = \DeviceApplicationsQuery::create()
      ->setFormatter('\PropelOnDemandFormatter')
      ->select('DeviceId')
      ->groupByDeviceId()
      ->having('MAX(device_applications.LAST_INIT_DATE) < ?', $date->format('Y-m-d H:i:s'))
      ->find();
    
    /** @var \DeviceApplications $devApp */
    foreach ($devApps as $devApp) {
        // Do something
    }
    

    Here the call to find() will not fetch the data, but instead will create a collection with on demand object creation.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 python的qt5界面
  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100