dongshou6788 2015-11-09 16:40
浏览 26
已采纳

我的评分算法问题?

What I'm looking for is the clubs with the highest ratings average based on number of votes and obviously the highest ratings.

What I have done is the following:

  • calculate the variance of the entire population (all votes && all clubs)
  • determine the standard deviation of each club against the population variance
  • calculate a new club weighted median based on subtracting one standard deviation to weed clubs with low votes

The problem is I cannot determine why my data is not showing correctly. I imagine there is an issue with my calcuations. I am getting numbers in the 10s and being negative when I should be receiving a value from 0-5 (as is the rating scores).

I'm not really quite sure where my logic fails.

Here is my ratings code logic:

    $sql="SELECT SUM(rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";   
    $rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));    

    // get the total summation of ratings against all reviews
    $ratingssum=(int)$rs['sumrating'];

    // get the total number of ratings against all reviews
    $ratingscount=(int)$rs['countrating'];  

    // get the population mediam
    $mean = $ratingssum / $ratingscount;    

    // determine the variance of the population
    $variance = 0;
    $sql="SELECT rating AS score FROM reviews";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){        
        $score = (int)$row['score'];        
        $variance += pow(($score-$mean),2);                     
    }
    $variance = $variance/$ratingscount;

    // loop through all clubs and implement new rating
    $scores=array();
    $sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
    $rs=mysqli_query($scx_dbh,$sql);

    /// begin loop
    while($row=mysqli_fetch_array($rs)){

        // get the number of review votes for this club
        $numvotes=(int)$row['locationrecordcount'];

        // get the location id 
        $locId = (int)$row['locid'];

        // find the standard deviation for this club (total variance * numclubvotes)
        $standarddev=sqrt($variance*$numvotes);

        // create the new rating for this club with 1 standard deviation less
        $oldRating=$row['locationmedian'];          
        $newRating=$oldRating-$standarddev;
        $scores[$locId] = array(            
          'numvotes'=>$numvotes,
          'standard-deviation'=>$standarddev,
          'original-rating'=> $oldRating,
          'weighted-rating'=>$newRating
       );
    }

    usort($scores,function($a,$b){
        return $a['weighted-rating']-$b['weighted-rating'];
    });

And here are my results:

top 10

 [0] => Array
        (
            [numvotes] => 1121
            [standard-deviation] => 68.898321138853
            [original-rating] => 4.415700267618207
            [weighted-rating] => -64.482620871235
        )

    [1] => Array
        (
            [numvotes] => 909
            [standard-deviation] => 62.042283630954
            [original-rating] => 3.1290979097910174
            [weighted-rating] => -58.913185721163
        )

    [2] => Array
        (
            [numvotes] => 594
            [standard-deviation] => 50.153247058093
            [original-rating] => 4.414225589225589
            [weighted-rating] => -45.739021468868
        )

    [3] => Array
        (
            [numvotes] => 505
            [standard-deviation] => 46.243587892712
            [original-rating] => 4.090099009900985
            [weighted-rating] => -42.153488882811
        )

    [4] => Array
        (
            [numvotes] => 517
            [standard-deviation] => 46.78979093937
            [original-rating] => 4.661025145067699
            [weighted-rating] => -42.128765794302
        )

    [5] => Array
        (
            [numvotes] => 505
            [standard-deviation] => 46.243587892712
            [original-rating] => 3.2117821782178173
            [weighted-rating] => -43.031805714494
        )

    [6] => Array
        (
            [numvotes] => 398
            [standard-deviation] => 41.053233483774
            [original-rating] => 4.231155778894469
            [weighted-rating] => -36.822077704879
        )

    [7] => Array
        (
            [numvotes] => 340
            [standard-deviation] => 37.944190471069
            [original-rating] => 3.9102941176470547
            [weighted-rating] => -34.033896353422
        )

    [8] => Array
        (
            [numvotes] => 323
            [standard-deviation] => 36.983422110177
            [original-rating] => 3.261145510835913
            [weighted-rating] => -33.722276599341
        )

    [9] => Array
        (
            [numvotes] => 280
            [standard-deviation] => 34.433791770728
            [original-rating] => 3.36767857142857
            [weighted-rating] => -31.066113199299
        )

    [10] => Array
        (
            [numvotes] => 254
            [standard-deviation] => 32.796136967109
            [original-rating] => 3.1411417322834665
            [weighted-rating] => -29.654995234825
        )

Worst 10

[232] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 4.95
        [weighted-rating] => 2.0398134378534
    )

[233] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 5
        [weighted-rating] => 2.0898134378534
    )

[234] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 4
        [weighted-rating] => 1.9421873473882
    )

[235] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 4.8
        [weighted-rating] => 1.8898134378534
    )

[236] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 3.25
        [weighted-rating] => 1.1921873473882
    )

[237] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[238] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[239] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 4.1
        [weighted-rating] => 2.0421873473882
    )

[240] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[241] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 5
        [weighted-rating] => 2.0898134378534
    )

)

UPDATE

OK so I recalculated the standard deviation against the entire population. It is 2.0578126526118.

Here is my current code:

    $sql="SELECT SUM(reviews.rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";   
    $rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));    
    $ratingssum=(int)$rs['sumrating'];
    $ratingscount=(int)$rs['countrating'];  
    $mean = $ratingssum / $ratingscount;    
    $variance = 0;
    $sql="SELECT rating AS score FROM reviews";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){        
        $score = (int)$row['score'];        
        $variance += pow(($score-$mean),2);                     
    }
    $variance = $variance/$ratingscount;
    $standarddev=sqrt($variance);
    $scores=array();
    $sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){
        $numvotes=(int)$row['locationrecordcount'];
        $locId = (int)$row['locid'];        
        $oldRating=$row['locationmedian'];
        $newRating=$oldRating-$standarddev;
        $scores[$locId] = array(            
            'numvotes'=>$numvotes,
            'standard-deviation'=>$standarddev,
            'original-rating'=> $oldRating,
            'weighted-rating'=>$newRating
        );
    }   
    usort($scores,function($a,$b){
        return (int)($a['weighted-rating']-$b['weighted-rating']);
    });

1./ I think my sorting function is incorrect. After sorting using my sorting function, these are the top 5:

     [0] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.2
                [weighted-rating] => -1.8578126526118
            )

        [1] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.05
                [weighted-rating] => -2.0078126526118
            )

        [2] => Array
            (
                [numvotes] => 4
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.7625
                [weighted-rating] => -1.2953126526118
            )

        [3] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.1
                [weighted-rating] => -1.9578126526118
            )

        [4] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.4
                [weighted-rating] => -1.6578126526118
            )

As you can see, besides the fact that they are producing negative numbers, it looks like the weighted-average of position 1 (index 0) is -1.85 and position 2 (index 1) is -2.00. I imagine there's an issue with either my algorithm or my sorting function in my code or else why are there negative numbers being sorted as first.

Also, i am getting clubs at positions 1 when they have 1 vote. The purpose of this algorithm was to weed these clubs out so I can focus on the clubs that have 1000's of votes.

Here are the bottom 5:

       [237] => Array
            (
                [numvotes] => 29
                [standard-deviation] => 2.0578126526118
                [original-rating] => 4.112068965517241
                [weighted-rating] => 2.0542563129054
            )

        [238] => Array
            (
                [numvotes] => 5
                [standard-deviation] => 2.0578126526118
                [original-rating] => 3.8800000000000003
                [weighted-rating] => 1.8221873473882
            )

        [239] => Array
            (
                [numvotes] => 31
                [standard-deviation] => 2.0578126526118
                [original-rating] => 3.7499999999999996
                [weighted-rating] => 1.6921873473882
            )

        [240] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 5
                [weighted-rating] => 2.9421873473882
            )

        [241] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 4.45
                [weighted-rating] => 2.3921873473882
            )

The same behavior exhibits on the bottom 5. I've got a weighted-average of 2.39 for position 5 (index 241) and a weighted-average of 2.94 for position 4 (index 240)

  • 写回答

2条回答 默认 最新

  • doujia4759 2015-11-09 16:56
    关注

    The standard deviation is calculated by the square root of the variation, not the square root of the variation multiplied by population (numvotes):

    // find the standard deviation for this club (total variance)
    $standarddev=sqrt($variance);
    

    If you want to weigh each club by themselves then you need to calculate the variation (and standard deviation) for each club. To do this you need to sum only the votes for each club, not all votes, and then calculate the variation and standard deviation. Then the variation and standard deviation for all votes seems unnecessary.

    Update:

    What you are trying to accomplish (weed out clubs with few votes) cannot be done using standard deviation (σ).

    Concider the following:

    • A club has 1 vote: [5]. σ is then 0. 5/1=5, (5-5)^2 / 1=0, sqrt(0)=0
    • A club has 1 vote: [1]. σ is again 0. 1/1=1, (1-1)^2 / 1=0, sqrt(0)=0
    • A club has 2 votes: [5, 5]. σ is yet again 0. 10/2=5, ((5-5)^2 + (5-5)^2) / 2=0, sqrt(0)=0

    Now you'd think that you can just weed out the clubs with a low σ.

    • A club has 2 votes: [1, 5]. σ is now 2.83. 6/2=3, ((1-3)^2 + (5-3)^2) / 2=8, sqrt(8)=2.83

    As you can see there is nothing here that says "hey, this club got lots of votes". The only thing σ says is how much spread on the votes there is. If there is no or a small spread (variation) then σ will be 0 or small and vice versa.

    What you could try is to look at the difference between a clubs σ (Cσ) and the total σ (Tσ). If that value is near 0 (to a limit, lets say 0.1) then you'd know that there is a similar variation in that club as in the whole population. But this will still not guarantee that there are atleast x amount of votes. This calculation would be something like abs(Cσ - Tσ) < 0.1.

    Regarding your sorting function:

    usort expects that the returned integer is -1, 0 or 1 to function properly. When you start to subtract negative numbers you'll get rather strange results. The correct sorting function should look something like this:

    usort($scores, function cmp($a, $b)
    {
      if ($a['weighted-rating'] == $b['weighted-rating']) {
        return 0;
      }
      return ($a['weighted-rating'] < $b['weighted-rating']) ? -1 : 1;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?