笑故挽风 2009-07-23 23:09 采纳率: 100%
浏览 241
已采纳

如何有效地计算一个正在运行的标准差?

I have an array of lists of numbers, e.g.:

[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
     ...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)

What I would like to do is efficiently calculate the mean and standard deviation at each index of a list, across all array elements.

To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n.

To do the standard deviation, I loop through again, now that I have the mean calculated.

I would like to avoid going through the array twice, once for the mean and then once for the SD (after I have a mean).

Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g. Perl or Python) or pseudocode is fine.

转载于:https://stackoverflow.com/questions/1174984/how-to-efficiently-calculate-a-running-standard-deviation

  • 写回答

14条回答 默认 最新

  • 笑故挽风 2009-08-28 18:24
    关注

    The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

    It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

    You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

    I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

    You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
  • elliott.david 2009-07-23 23:14
    关注

    You could look at the Wikipedia article on Standard Deviation, in particular the section about Rapid calculation methods.

    There's also an article I found that uses Python, you should be able to use the code in it without much change: Subliminal Messages - Running Standard Deviations.

    评论
  • 胖鸭 2009-07-23 23:21
    关注

    Statistics::Descriptive is a very decent Perl module for these types of calculations:

    #!/usr/bin/perl
    
    use strict; use warnings;
    
    use Statistics::Descriptive qw( :all );
    
    my $data = [
        [ 0.01, 0.01, 0.02, 0.04, 0.03 ],
        [ 0.00, 0.02, 0.02, 0.03, 0.02 ],
        [ 0.01, 0.02, 0.02, 0.03, 0.02 ],
        [ 0.01, 0.00, 0.01, 0.05, 0.03 ],
    ];
    
    my $stat = Statistics::Descriptive::Full->new;
    # You also have the option of using sparse data structures
    
    for my $ref ( @$data ) {
        $stat->add_data( @$ref );
        printf "Running mean: %f\n", $stat->mean;
        printf "Running stdev: %f\n", $stat->standard_deviation;
    }
    __END__
    

    Output:

    C:\Temp> g
    Running mean: 0.022000
    Running stdev: 0.013038
    Running mean: 0.020000
    Running stdev: 0.011547
    Running mean: 0.020000
    Running stdev: 0.010000
    Running mean: 0.020000
    Running stdev: 0.012566
    
    评论
  • 胖鸭 2009-07-23 23:31
    关注

    I think this issue will help you. Standard deviation

    评论
  • 7*4 2009-07-23 23:35
    关注

    How big is your array? Unless it is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.

    My preference would be to use the numpy array maths extension to convert your array of arrays into a numpy 2D array and get the standard deviation directly:

    >>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
    >>> import numpy
    >>> a = numpy.array(x)
    >>> a.std(axis=0) 
    array([ 1. ,  1. ,  0.5,  1.5,  1.5,  1.5])
    >>> a.mean(axis=0)
    array([ 2. ,  3. ,  4.5,  4.5,  5.5,  6.5])
    

    If that's not an option and you need a pure Python solution, keep reading...

    If your array is

    x = [ 
          [ 1, 2, 4, 3, 4, 5 ],
          [ 3, 4, 5, 6, 7, 8 ],
          ....
    ]
    

    Then the standard deviation is:

    d = len(x[0])
    n = len(x)
    sum_x = [ sum(v[i] for v in x) for i in range(d) ]
    sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
    std_dev = [ sqrt((sx2 - sx**2)/N)  for sx, sx2 in zip(sum_x, sum_x2) ]
    

    If you are determined to loop through your array only once, the running sums can be combined.

    sum_x  = [ 0 ] * d
    sum_x2 = [ 0 ] * d
    for v in x:
       for i, t in enumerate(v):
       sum_x[i] += t
       sum_x2[i] += t**2
    

    This isn't nearly as elegant as the list comprehension solution above.

    评论
  • 程序go 2009-07-23 23:39
    关注

    The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x2 (call it 'sum_x2') as you go. The value of the standard deviation is then:

    stdev = sqrt((sum_x2 / n) - (mean * mean)) 
    

    where

    mean = sum_x / n
    

    This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.

    You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

    评论
  • hurriedly% 2009-07-24 02:32
    关注

    Perhaps not what you were asking, but ... If you use a numpy array, it will do the work for you, efficiently:

    from numpy import array
    
    nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
                  (0.00, 0.02, 0.02, 0.03, 0.02),
                  (0.01, 0.02, 0.02, 0.03, 0.02),
                  (0.01, 0.00, 0.01, 0.05, 0.03)))
    
    print nums.std(axis=1)
    # [ 0.0116619   0.00979796  0.00632456  0.01788854]
    
    print nums.mean(axis=1)
    # [ 0.022  0.018  0.02   0.02 ]
    

    By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:

    评论
  • 撒拉嘿哟木头 2009-07-24 17:34
    关注

    Have a look at PDL (pronounced "piddle!").

    This is the Perl Data Language which is designed for high precision mathematics and scientific computing.

    Here is an example using your figures....

    use strict;
    use warnings;
    use PDL;
    
    my $figs = pdl [
        [0.01, 0.01, 0.02, 0.04, 0.03],
        [0.00, 0.02, 0.02, 0.03, 0.02],
        [0.01, 0.02, 0.02, 0.03, 0.02],
        [0.01, 0.00, 0.01, 0.05, 0.03],
    ];
    
    my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
    
    say "Mean scores:     ", $mean;
    say "Std dev? (adev): ", $adev;
    say "Std dev? (prms): ", $prms;
    say "Std dev? (rms):  ", $rms;
    


    Which produces:

    Mean scores:     [0.022 0.018 0.02 0.02]
    Std dev? (adev): [0.0104 0.0072 0.004 0.016]
    Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
    Std dev? (rms):  [0.011661904 0.009797959 0.0063245553 0.017888544]
    


    Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".

    However it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)

    For more PDL information have a look at:

    评论
  • MAO-EYE 2013-04-27 01:34
    关注

    Here's a "one-liner", spread over multiple lines, in functional programming style:

    def variance(data, opt=0):
        return (lambda (m2, i, _): m2 / (opt + i - 1))(
            reduce(
                lambda (m2, i, avg), x:
                (
                    m2 + (x - avg) ** 2 * i / (i + 1),
                    i + 1,
                    avg + (x - avg) / (i + 1)
                ),
                data,
                (0, 0, 0)))
    
    评论
  • 乱世@小熊 2013-07-14 07:10
    关注

    Here is a literal pure Python translation of the Welford's algorithm implementation from http://www.johndcook.com/standard_deviation.html:

    https://github.com/liyanage/python-modules/blob/master/running_stats.py

    class RunningStats:
    
        def __init__(self):
            self.n = 0
            self.old_m = 0
            self.new_m = 0
            self.old_s = 0
            self.new_s = 0
    
        def clear(self):
            self.n = 0
    
        def push(self, x):
            self.n += 1
    
            if self.n == 1:
                self.old_m = self.new_m = x
                self.old_s = 0
            else:
                self.new_m = self.old_m + (x - self.old_m) / self.n
                self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)
    
                self.old_m = self.new_m
                self.old_s = self.new_s
    
        def mean(self):
            return self.new_m if self.n else 0.0
    
        def variance(self):
            return self.new_s / (self.n - 1) if self.n > 1 else 0.0
    
        def standard_deviation(self):
            return math.sqrt(self.variance())
    

    Usage:

    rs = RunningStats()
    rs.push(17.0);
    rs.push(19.0);
    rs.push(24.0);
    
    mean = rs.mean();
    variance = rs.variance();
    stdev = rs.standard_deviation();
    
    评论
  • Lotus@ 2013-12-30 01:46
    关注

    The Python runstats Module is for just this sort of thing. Install runstats from PyPI:

    pip install runstats
    

    Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.

    from runstats import Statistics
    
    stats = [Statistics() for num in range(len(data[0]))]
    
    for row in data:
    
        for index, val in enumerate(row):
            stats[index].push(val)
    
        for index, stat in enumerate(stats):
            print 'Index', index, 'mean:', stat.mean()
            print 'Index', index, 'standard deviation:', stat.stddev()
    

    Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.

    Disclaimer: I am the author the Python runstats module.

    评论
  • perhaps? 2014-11-03 14:38
    关注
    n=int(raw_input("Enter no. of terms:"))
    
    L=[]
    
    for i in range (1,n+1):
    
        x=float(raw_input("Enter term:"))
    
        L.append(x)
    
    sum=0
    
    for i in range(n):
    
        sum=sum+L[i]
    
    avg=sum/n
    
    sumdev=0
    
    for j in range(n):
    
        sumdev=sumdev+(L[j]-avg)**2
    
    dev=(sumdev/n)**0.5
    
    print "Standard deviation is", dev
    
    评论
  • 笑故挽风 2016-12-22 09:51
    关注

    As the following answer describes: Does pandas/scipy/numpy provide a cumulative standard deviation function? The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that you'll have to convert your data into a pandas dataframe (or a series if it is 1D), but there are functions for that.

    评论
  • 喵-见缝插针 2018-06-06 21:06
    关注

    I like to express the update this way:

    def running_update(x, N, mu, var):
        '''
            @arg x: the current data sample
            @arg N : the number of previous samples
            @arg mu: the mean of the previous samples
            @arg var : the variance over the previous samples
            @retval (N+1, mu', var') -- updated mean, variance and count
        '''
        N = N + 1
        rho = 1.0/N
        d = x - mu
        mu += rho*d
        var += rho*((1-rho)*d**2 - var)
        return (N, mu, var)
    

    so that a one-pass function would look like this:

    def one_pass(data):
        N = 0
        mu = 0.0
        var = 0.0
        for x in data:
            N = N + 1
            rho = 1.0/N
            d = x - mu
            mu += rho*d
            var += rho*((1-rho)*d**2 - var)
            # could yield here if you want partial results
       return (N, mu, var)
    

    note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).

    In a class it would look like this:

    class RunningMeanVar(object):
        def __init__(self):
            self.N = 0
            self.mu = 0.0
            self.var = 0.0
        def push(self, x):
            self.N = self.N + 1
            rho = 1.0/N
            d = x-self.mu
            self.mu += rho*d
            self.var += + rho*((1-rho)*d**2-self.var)
        # reset, accessors etc. can be setup as you see fit
    

    This also works for weighted samples:

    def running_update(w, x, N, mu, var):
        '''
            @arg w: the weight of the current sample
            @arg x: the current data sample
            @arg mu: the mean of the previous N sample
            @arg var : the variance over the previous N samples
            @arg N : the number of previous samples
            @retval (N+w, mu', var') -- updated mean, variance and count
        '''
        N = N + w
        rho = w/N
        d = x - mu
        mu += rho*d
        var += rho*((1-rho)*d**2 - var)
        return (N, mu, var)
    
    评论
查看更多回答(13条)

报告相同问题?

悬赏问题

  • ¥15 Pycharm无法自动补全,识别第三方库函数接收的参数!
  • ¥15 STM32U575 pwm和DMA输出的波形少一段
  • ¥30 android百度地图SDK海量点显示标题
  • ¥15 windows导入environment.yml运行conda env create -f environment_win.yml命令报错
  • ¥15 这段代码可以正常运行,打包后无法执行,在执行for内容之前一直不断弹窗,请修改调整
  • ¥15 C语言判断有向图是否存在环路
  • ¥15 请问4.11到4.18以及4.27和4.29公式的具体推导过程是怎样的呢
  • ¥20 将resnet50中的卷积替换微ODConv动态卷积
  • ¥15 通过文本框输入商品信息点击按钮将商品信息列举出来点击加入购物车商品信息添加到表单中
  • ¥100 这是什么压缩算法?如何解压?