dpjj4763 2015-05-31 13:45
浏览 403

Elasticsearch PHP批量索引性能与索引

I run a benchmark on elasticsearch using elasticsearch-php. I compare the time taken by 10 000 index one by one vs 10 000 with bulk of 1 000 documents.

On my vpn server 3 cores 2 Gb mem the performance is quite the same with or without bulk index.

My php code (inspired by à post):

<?php
set_time_limit(0);  //  no timeout
require 'vendor/autoload.php';
$es = new Elasticsearch\Client([
    'hosts'=>['127.0.0.1:9200']
]);
$max = 10000;

// ELASTICSEARCH BULK INDEX
$temps_debut = microtime(true);
for ($i = 0; $i <=  $max; $i++) {
    $params['body'][] = array(
        'index' => array(
            '_index' => 'articles',
            '_type' => 'article',
            '_id' => 'cle' . $i
        )
    );
    $params['body'][] = array(
        'my_field' => 'my_value' . $i
    );
    if ($i % 1000) {   // Every 1000 documents stop and send the bulk request
        $responses = $es->bulk($params);
        $params = array();  // erase the old bulk request    
        unset($responses); // unset  to save memory
    }
}
$temps_fin = microtime(true);
echo 'Elasticsearch bulk: ' . round($i / round($temps_fin - $temps_debut, 4)) . ' per sec <br>';

// ELASTICSEARCH WITHOUT BULK INDEX
$temps_debut = microtime(true);
        for ($i = 1; $i <= $max; $i++) {    
            $params = array();
            $params['index'] = 'my_index';
            $params['type']  = 'my_type';
            $params['id']    = "key".$i;
            $params['body']  = array('testField' => 'valeur'.$i);
            $ret = $es->index($params);
        }
$temps_fin = microtime(true);
echo 'Elasticsearch One by one : ' . round($i / round($temps_fin - $temps_debut, 4)) . 'per sec <br>';
?>

Elasticsearch bulk: 1209 per sec Elasticsearch One by one : 1197per sec

Is there something wrong on my bulk index to obtain better performance ?

Thank's

  • 写回答

1条回答 默认 最新

  • dsadsadsa1231 2015-05-31 21:21
    关注

    Replace:

    if ($i % 1000) {   // Every 1000 documents stop and send the bulk request
    

    with:

    if (($i + 1) % 1000 === 0) {   // Every 1000 documents stop and send the bulk request
    

    or you will query for each non-0 value (that is 999 of 1000)... Obviously, this only works if $max is a multiple of 1000.

    Also, correct this bug:

    for ($i = 0; $i <=  $max; $i++) {
    

    will iterate over $max + 1 items. replace it with:

    for ($i = 0; $i < $max; $i++) {
    

    There might also be a problem with how you initialize $params. Shouldn't you set it up outside of the loop and only clean-up the $params['body'] after each ->bulk()? When you reset with $params = array(); you loose all of it.

    Also, remember that ES may be distributed over a cluster. Bulk operations can then be distributed to even the workload. So some performance scaling is not visible on a single physical node.

    评论

报告相同问题?

悬赏问题

  • ¥15 matlab中使用gurobi时报错
  • ¥15 WPF 大屏看板表格背景图片设置
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂