ds000001 2015-11-02 19:04
浏览 48
已采纳

PHP中的对象比较和数组排序

I have a problem with object comparison in PHP. What seems like a straightforward code actually runs way too slow for my liking and as I am not that advanced in the language I would like some feedback and suggestions regarding the following code:

class TestTokenGroup {
    private $tokens;
    ...

    public static function create($tokens) {
        $instance = new static();
        $instance->tokens = $tokens;
        ...
        return $instance;
    }

    public function getTokens() {
        return $this->tokens;
    }

    public static function compare($tokenGroup1, $tokenGroup2) {
        $i = 0;
        $minLength = min(array(count($tokenGroup1->getTokens()), count($tokenGroup2->getTokens())));
        $equalLengths = (count($tokenGroup1->getTokens()) == count($tokenGroup2->getTokens()));
        $comparison = strcmp($tokenGroup1->getTokens()[$i], $tokenGroup2->getTokens()[$i]);
        while ($comparison == 0) {
            $i++;
            if (($i == $minLength) && ($equalLengths == true)) {
                return 0;
            }
            $comparison = strcmp($tokenGroup1->getTokens()[$i], $tokenGroup2->getTokens()[$i]);
        }
        $result = $comparison;
        if ($result < 0)
            return -1;
        elseif ($result > 0)
            return 1;
        else
            return 0;
    }
    ...

}

In the code above $tokens is just a simple array of strings.

Using the method above through usort() for an array of TestTokenGroup consisting of around 40k objects takes ~2secs.

Is there a sensible way to speed that up? Where is the bottleneck here?

EDIT: Added the getTokens() method I initially forgot to include.

  • 写回答

1条回答 默认 最新

  • dongwen7730 2015-11-02 19:17
    关注

    You know that objects are "pass by reference", and arrays are "pass by value"?

    If getTokens() returns $this->tokens, the array is copied every time you invoke that method.

    Try accessing $tokens directly via $tokenGroup1->tokens. You could also use references (&) although returning a reference doesn't work in all PHP versions.

    Alternatively, make one copy only:

    $tokens1 = $tokenGroup1->getTokens();
    $tokens2 = $tokenGroup2->getTokens();
    

    Even if each token group is relatively small, it will save at least 40000 * ( 6 + $average_token_group_length * 2) array copies.

    UPDATE

    I've benchmarked OP's code (removing the ... lines) using:

    function gentokens() {
            $ret = [];
            for ( $i=0; $i< 3; $i++)
            {
                    $str = "";
                    for ( $x = rand(0,3); $x < 10; $x ++ )
                            $str .= chr( rand(0,25) + ord('a') );
                    $ret[] = $str;
            }
            return $ret;
    }
    
    
    $start = microtime(true);
    
    $array = [];    // this will hold the TestTokenGroup instances
    $dummy = "";    // this will hold the tokens, space-separated and newline-separated
    $dummy2= [];    // this will hold the space-concatenated strings
    
    for ( $i=0; $i < 40000; $i++)
    {
            $array[] = TestTokenGroup::create( $t = gentokens() );
    
            $dummy   .= implode(' ', $t ) . "
    ";
            $dummy2[] = implode(' ', $t );
    }
    
    // write a test file to benchmark GNU sort:
    file_put_contents("sort-data.txt", $dummy);
    
    $inited = microtime(true);
    printf("init: %f s
    ", ($inited-$start));
    
    usort( $array, [ 'TestTokenGroup', 'compare'] );
    
    $sorted = microtime(true);
    printf("sort: %f s
    ", ($sorted-$inited));
    
    usort( $dummy2, 'strcmp' );
    
    $sorted2 = microtime(true);
    printf("sort: %f s
    ", ($sorted2-$sorted));
    

    With the following results:

    init: 0.359329 s    // for generating 40000 * 3 random strings and setup
    sort: 1.012096 s    // for the TestTokenGroup::compare
    sort: 0.120583 s    // for the 'strcmp' compare
    

    And, running time sort sort-data.txt > /dev/null yields

    .052 u  (user-time, in seconds).
    

    optimisation 1: remove array copies

    replacing ->getTokens() with ->tokens yields (I'll only list the TestTokenGroup::compare results):

    sort: 0.832794 s
    

    Optimisation 2: remove redundant array() in min

    Changing the $minlength line to:

    $minLength = min(count($tokenGroup1->tokens), count($tokenGroup2->tokens));
    

    gives

    sort: 0.779134 s
    

    Optimisation 3: Only call count once for each tokenGroup

        $count1 = count($tokenGroup1->tokens);
        $count2 = count($tokenGroup2->tokens);
        $minLength = min($count1, $count2);
        $equalLengths = ($count1 == $count2);
    

    gives

    sort: 0.679649 s
    

    Alternative approach

    The fastest sort so far is strcmp( $stringarray, 'strcmp' ): 0.12s - still twice as slow as GNU sort, but the latter only does one thing, and does it well.

    So, to sort the TokenGroups efficiently we need to construct sort key consisting of a simple string. We can use \0 as a delimiter for the tokens, and we don't have to worry about them being equal length, because as soon as one character is different, the compare aborts.

    Here's the implementation:

    $arr2 = [];
    foreach ( $array as $o )
      $arr2[ implode("\0", $o->getTokens() ) ] = $o;
    
    $init2 = microtime(true);
    printf("init2: %f s
    ", ($init2-$sorted2));
    
    uksort( $arr2, 'strcmp' );
    
    $sorted3 = microtime(true);
    printf("sort: %f s
    ", ($sorted3-$init2));
    

    and here the results:

    init2: 0.125939 s
    sort: 0.104717 s
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求MCSCANX 帮助
  • ¥15 机器学习训练相关模型
  • ¥15 Todesk 远程写代码 anaconda jupyter python3
  • ¥15 我的R语言提示去除连锁不平衡时clump_data报错,图片以下所示,卡了好几天了,苦恼不知道如何解决,有人帮我看看怎么解决吗?
  • ¥15 在获取boss直聘的聊天的时候只能获取到前40条聊天数据
  • ¥20 关于URL获取的参数,无法执行二选一查询
  • ¥15 液位控制,当液位超过高限时常开触点59闭合,直到液位低于低限时,断开
  • ¥15 marlin编译错误,如何解决?
  • ¥15 VUE项目怎么运行,系统打不开
  • ¥50 pointpillars等目标检测算法怎么融合注意力机制