dongshou6788 2015-11-09 16:40
浏览 26
已采纳

我的评分算法问题?

What I'm looking for is the clubs with the highest ratings average based on number of votes and obviously the highest ratings.

What I have done is the following:

  • calculate the variance of the entire population (all votes && all clubs)
  • determine the standard deviation of each club against the population variance
  • calculate a new club weighted median based on subtracting one standard deviation to weed clubs with low votes

The problem is I cannot determine why my data is not showing correctly. I imagine there is an issue with my calcuations. I am getting numbers in the 10s and being negative when I should be receiving a value from 0-5 (as is the rating scores).

I'm not really quite sure where my logic fails.

Here is my ratings code logic:

    $sql="SELECT SUM(rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";   
    $rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));    

    // get the total summation of ratings against all reviews
    $ratingssum=(int)$rs['sumrating'];

    // get the total number of ratings against all reviews
    $ratingscount=(int)$rs['countrating'];  

    // get the population mediam
    $mean = $ratingssum / $ratingscount;    

    // determine the variance of the population
    $variance = 0;
    $sql="SELECT rating AS score FROM reviews";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){        
        $score = (int)$row['score'];        
        $variance += pow(($score-$mean),2);                     
    }
    $variance = $variance/$ratingscount;

    // loop through all clubs and implement new rating
    $scores=array();
    $sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
    $rs=mysqli_query($scx_dbh,$sql);

    /// begin loop
    while($row=mysqli_fetch_array($rs)){

        // get the number of review votes for this club
        $numvotes=(int)$row['locationrecordcount'];

        // get the location id 
        $locId = (int)$row['locid'];

        // find the standard deviation for this club (total variance * numclubvotes)
        $standarddev=sqrt($variance*$numvotes);

        // create the new rating for this club with 1 standard deviation less
        $oldRating=$row['locationmedian'];          
        $newRating=$oldRating-$standarddev;
        $scores[$locId] = array(            
          'numvotes'=>$numvotes,
          'standard-deviation'=>$standarddev,
          'original-rating'=> $oldRating,
          'weighted-rating'=>$newRating
       );
    }

    usort($scores,function($a,$b){
        return $a['weighted-rating']-$b['weighted-rating'];
    });

And here are my results:

top 10

 [0] => Array
        (
            [numvotes] => 1121
            [standard-deviation] => 68.898321138853
            [original-rating] => 4.415700267618207
            [weighted-rating] => -64.482620871235
        )

    [1] => Array
        (
            [numvotes] => 909
            [standard-deviation] => 62.042283630954
            [original-rating] => 3.1290979097910174
            [weighted-rating] => -58.913185721163
        )

    [2] => Array
        (
            [numvotes] => 594
            [standard-deviation] => 50.153247058093
            [original-rating] => 4.414225589225589
            [weighted-rating] => -45.739021468868
        )

    [3] => Array
        (
            [numvotes] => 505
            [standard-deviation] => 46.243587892712
            [original-rating] => 4.090099009900985
            [weighted-rating] => -42.153488882811
        )

    [4] => Array
        (
            [numvotes] => 517
            [standard-deviation] => 46.78979093937
            [original-rating] => 4.661025145067699
            [weighted-rating] => -42.128765794302
        )

    [5] => Array
        (
            [numvotes] => 505
            [standard-deviation] => 46.243587892712
            [original-rating] => 3.2117821782178173
            [weighted-rating] => -43.031805714494
        )

    [6] => Array
        (
            [numvotes] => 398
            [standard-deviation] => 41.053233483774
            [original-rating] => 4.231155778894469
            [weighted-rating] => -36.822077704879
        )

    [7] => Array
        (
            [numvotes] => 340
            [standard-deviation] => 37.944190471069
            [original-rating] => 3.9102941176470547
            [weighted-rating] => -34.033896353422
        )

    [8] => Array
        (
            [numvotes] => 323
            [standard-deviation] => 36.983422110177
            [original-rating] => 3.261145510835913
            [weighted-rating] => -33.722276599341
        )

    [9] => Array
        (
            [numvotes] => 280
            [standard-deviation] => 34.433791770728
            [original-rating] => 3.36767857142857
            [weighted-rating] => -31.066113199299
        )

    [10] => Array
        (
            [numvotes] => 254
            [standard-deviation] => 32.796136967109
            [original-rating] => 3.1411417322834665
            [weighted-rating] => -29.654995234825
        )

Worst 10

[232] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 4.95
        [weighted-rating] => 2.0398134378534
    )

[233] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 5
        [weighted-rating] => 2.0898134378534
    )

[234] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 4
        [weighted-rating] => 1.9421873473882
    )

[235] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 4.8
        [weighted-rating] => 1.8898134378534
    )

[236] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 3.25
        [weighted-rating] => 1.1921873473882
    )

[237] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[238] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[239] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 4.1
        [weighted-rating] => 2.0421873473882
    )

[240] => Array
    (
        [numvotes] => 1
        [standard-deviation] => 2.0578126526118
        [original-rating] => 5
        [weighted-rating] => 2.9421873473882
    )

[241] => Array
    (
        [numvotes] => 2
        [standard-deviation] => 2.9101865621466
        [original-rating] => 5
        [weighted-rating] => 2.0898134378534
    )

)

UPDATE

OK so I recalculated the standard deviation against the entire population. It is 2.0578126526118.

Here is my current code:

    $sql="SELECT SUM(reviews.rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";   
    $rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));    
    $ratingssum=(int)$rs['sumrating'];
    $ratingscount=(int)$rs['countrating'];  
    $mean = $ratingssum / $ratingscount;    
    $variance = 0;
    $sql="SELECT rating AS score FROM reviews";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){        
        $score = (int)$row['score'];        
        $variance += pow(($score-$mean),2);                     
    }
    $variance = $variance/$ratingscount;
    $standarddev=sqrt($variance);
    $scores=array();
    $sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
    $rs=mysqli_query($scx_dbh,$sql);
    while($row=mysqli_fetch_array($rs)){
        $numvotes=(int)$row['locationrecordcount'];
        $locId = (int)$row['locid'];        
        $oldRating=$row['locationmedian'];
        $newRating=$oldRating-$standarddev;
        $scores[$locId] = array(            
            'numvotes'=>$numvotes,
            'standard-deviation'=>$standarddev,
            'original-rating'=> $oldRating,
            'weighted-rating'=>$newRating
        );
    }   
    usort($scores,function($a,$b){
        return (int)($a['weighted-rating']-$b['weighted-rating']);
    });

1./ I think my sorting function is incorrect. After sorting using my sorting function, these are the top 5:

     [0] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.2
                [weighted-rating] => -1.8578126526118
            )

        [1] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.05
                [weighted-rating] => -2.0078126526118
            )

        [2] => Array
            (
                [numvotes] => 4
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.7625
                [weighted-rating] => -1.2953126526118
            )

        [3] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.1
                [weighted-rating] => -1.9578126526118
            )

        [4] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 0.4
                [weighted-rating] => -1.6578126526118
            )

As you can see, besides the fact that they are producing negative numbers, it looks like the weighted-average of position 1 (index 0) is -1.85 and position 2 (index 1) is -2.00. I imagine there's an issue with either my algorithm or my sorting function in my code or else why are there negative numbers being sorted as first.

Also, i am getting clubs at positions 1 when they have 1 vote. The purpose of this algorithm was to weed these clubs out so I can focus on the clubs that have 1000's of votes.

Here are the bottom 5:

       [237] => Array
            (
                [numvotes] => 29
                [standard-deviation] => 2.0578126526118
                [original-rating] => 4.112068965517241
                [weighted-rating] => 2.0542563129054
            )

        [238] => Array
            (
                [numvotes] => 5
                [standard-deviation] => 2.0578126526118
                [original-rating] => 3.8800000000000003
                [weighted-rating] => 1.8221873473882
            )

        [239] => Array
            (
                [numvotes] => 31
                [standard-deviation] => 2.0578126526118
                [original-rating] => 3.7499999999999996
                [weighted-rating] => 1.6921873473882
            )

        [240] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 5
                [weighted-rating] => 2.9421873473882
            )

        [241] => Array
            (
                [numvotes] => 1
                [standard-deviation] => 2.0578126526118
                [original-rating] => 4.45
                [weighted-rating] => 2.3921873473882
            )

The same behavior exhibits on the bottom 5. I've got a weighted-average of 2.39 for position 5 (index 241) and a weighted-average of 2.94 for position 4 (index 240)

  • 写回答

2条回答 默认 最新

  • doujia4759 2015-11-09 16:56
    关注

    The standard deviation is calculated by the square root of the variation, not the square root of the variation multiplied by population (numvotes):

    // find the standard deviation for this club (total variance)
    $standarddev=sqrt($variance);
    

    If you want to weigh each club by themselves then you need to calculate the variation (and standard deviation) for each club. To do this you need to sum only the votes for each club, not all votes, and then calculate the variation and standard deviation. Then the variation and standard deviation for all votes seems unnecessary.

    Update:

    What you are trying to accomplish (weed out clubs with few votes) cannot be done using standard deviation (σ).

    Concider the following:

    • A club has 1 vote: [5]. σ is then 0. 5/1=5, (5-5)^2 / 1=0, sqrt(0)=0
    • A club has 1 vote: [1]. σ is again 0. 1/1=1, (1-1)^2 / 1=0, sqrt(0)=0
    • A club has 2 votes: [5, 5]. σ is yet again 0. 10/2=5, ((5-5)^2 + (5-5)^2) / 2=0, sqrt(0)=0

    Now you'd think that you can just weed out the clubs with a low σ.

    • A club has 2 votes: [1, 5]. σ is now 2.83. 6/2=3, ((1-3)^2 + (5-3)^2) / 2=8, sqrt(8)=2.83

    As you can see there is nothing here that says "hey, this club got lots of votes". The only thing σ says is how much spread on the votes there is. If there is no or a small spread (variation) then σ will be 0 or small and vice versa.

    What you could try is to look at the difference between a clubs σ (Cσ) and the total σ (Tσ). If that value is near 0 (to a limit, lets say 0.1) then you'd know that there is a similar variation in that club as in the whole population. But this will still not guarantee that there are atleast x amount of votes. This calculation would be something like abs(Cσ - Tσ) < 0.1.

    Regarding your sorting function:

    usort expects that the returned integer is -1, 0 or 1 to function properly. When you start to subtract negative numbers you'll get rather strange results. The correct sorting function should look something like this:

    usort($scores, function cmp($a, $b)
    {
      if ($a['weighted-rating'] == $b['weighted-rating']) {
        return 0;
      }
      return ($a['weighted-rating'] < $b['weighted-rating']) ? -1 : 1;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么