weixin_39866646
weixin_39866646
2020-12-31 05:16

mrmath/mrstats more precise variance calculation

I reimplemented the variance kernel using Welford's online algorithm to fix catastrophic cancellation:

precision test with random noise:


mrconvert  image.mif -coord 3 0 - | mrcalc - 0 -mult rand 1000 -mult -add large.mif -force
mrstats large.mif 
      volume       mean     median      stdev        min        max      count
       [ 0 ]    499.789 500.431824    288.816 0.00315811    999.994     902629

old


mrcat large.mif large.mif large.mif -axis 3 - | mrmath - var -axis 3 - | mrstats -
      volume       mean     median      stdev        min        max      count
       [ 0 ] -3.72367e-06 2.20043717e-09  0.0157959 -0.0468736  0.0468749     902629

new

  volume       mean     median      stdev        min        max      count
   [ 0 ]          0          0          0          0          0     902629

test against numpy with b1000 90 dir image, float32:


mrstats 1k.mif 
      volume       mean     median      stdev        min        max      count
       [ 0 ]    346.971          0    641.885   -134.515    4366.91     902629
       [ 1 ]    332.922          0    615.249   -122.857    4396.99     902629
       [ 2 ]    349.577          0    644.673    -138.88    4989.46     902629
       [ 3 ]    339.598          0    628.063   -129.644    4064.88     902629
       [ 4 ]    345.108          0    636.355   -126.902    4334.92     902629
       [ 5 ]    347.219          0    641.879   -123.496    4797.42     902629
       [ 6 ]    341.361          0    630.904   -136.148    4907.52     902629
...
      [ 88 ]    344.547          0    636.797    -121.31    4348.49     902629
      [ 89 ]    345.772          0    637.696   -129.747    4646.53     902629

numpy:

numpy.var(IMAGE1K.astype(np.float64), axis=3, ddof=1) --> var_1k_np64b.mif

old


mrmath 1k.mif var -axis 3 - | mrcalc var_1k_np64b.mif - -sub - | mrstats -
      volume       mean     median      stdev        min        max      count
       [ 0 ] -1.01356e-05          0 0.00351362    -0.0625  0.0703125     902629

new


      volume       mean     median      stdev        min        max      count
       [ 0 ]          0          0          0          0          0     902629

18 identical b0s

Difference between variance of 18 identical b0s between numpy.var and old mrmath var (min, max intensity: -561.946, 14416): image

The difference between numpy and the proposed changes is zero.

该提问来源于开源项目:MRtrix3/mrtrix3

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

14条回答

  • weixin_39918128 weixin_39918128 4月前

    Would it be beneficial to do the same thing in mrstats?

    点赞 评论 复制链接分享
  • weixin_39866646 weixin_39866646 4月前

    Would it be beneficial to do the same thing in mrstats?

    Yes,mrstats suffers from the same problem (although it manifests only if a volume has large but similar values which is arguably less likely than similar values across volumes).

    I've implemented it but there are a number of issues to think about:

    • mrstats uses the population variance, mrmath uses the sample variance, do we want to unify those?
    • mrstats outputs the variance for single-voxel images, I changed that to min 2 voxels.
    • mrstats works with complex numbers, mrmath silently casts to default_type
    • for complex numbers the mrstats std calculation is incorrect unless we are talking about pseudo-std (see doi:10.1007/978-981-10-2920-2_31). Variance is always nonnegative and real and should be the sum of the variance of real and imaginary parts, pseudo-variance is complex(var(real), var(imag)). I have implemented the pseudo-std for backwards compatibility but I'd suggest making this clear in the description or change it to the actual definition. Which shall it be?
    点赞 评论 复制链接分享
  • weixin_39918128 weixin_39918128 4月前

    mrstats uses the population variance, mrmath uses the sample variance, do we want to unify those?

    :man_shrugging:

    mrstats outputs the variance for single-voxel images, I changed that to min 2 voxels.

    :+1:

    mrstats works with complex numbers, mrmath silently casts to default_type

    This is probably the case with many commands: devs operating on complex data have updated those commands that they have required at the time to achieve such, but it's not been done across the board systematically. It also means that many commands are operating natively using complex data even if input / output are real, which is 99.99% of the time. I'd really like to go through and template a lot of these commands for real/complex data type at some point.

    for complex numbers the mrstats std calculation is incorrect

    Deferring to , but I'd think that defaulting to the real sum of variances and toggling to obtain pseudo-variance using a command-line option would make the most sense.

    点赞 评论 复制链接分享
  • weixin_39870413 weixin_39870413 4月前

    Definitely happy to use the sample variance in mrstats also. Haven't bothered till now since the number of voxels involved would typically be so large that it shouldn't matter. But there's no reason not to. And it would start to matter with small masks.

    And yes, it would be nice if all our tools dealt with complex numbers appropriately where it makes sense...

    However, I'm not sure what to do with regards the output of mrstats for the variance of complex numbers. While the definition of variance on complex numbers as the sum of the real and imaginary numbers might be mathematically more sound, it doesn't 'feel' like the right answer... I would have thought the mean of the two would be better. I like the existing approach since that preserves the most information, and can trivially be used to figure out the sum or mean of variances if desired. But I'd be happy enough with additional options to specify which to produce.

    In terms of how that's implemented, we could either:

    • add additional choices to the -output option (the simplest and cleanest IMO), leaving the current tabular output unmodified (but appropriately documented)

    • add options to modify the tabular output based on the version of the variance desired. Not convinced this is necessarily the most versatile solution though.

    I'm also not convinced I like the idea of providing the actual sum of variances since that's not what I would have expected - I'd naively expect it to be the mean of the variances, and I'd be surprised if I was alone in that. If we stick with current output, at least it's clear what it is. If it's a single number, I certainly would have interpreted it wrongly... The danger with providing a single real number for the variance is that without taking a good look at the doc, many users would simply misinterpret it... If we stick with the current output and require users to supply different options to get the real-valued version, at least we can expect they would have read the relevant docs...?

    点赞 评论 复制链接分享
  • weixin_39639965 weixin_39639965 4月前

    Just my 2 cents, feel free to ignore.

    I'm a bit surprised here: isn't this relatively commonly known, at least "among people who work frequently with complex numbers" or something like that? I get that, if you've not seen that definition before, you probably need a moment to think about it and why it could make sense. But in some ways, you'd rather quickly move away from the reflex of thinking i'd be a mean of variances, if that was an initial reflex / expectation to begin with of course (please, let it be clear again I don't mean to offend; just in case). It's supposed to be a generalisation of the real number case, right? So basically, whatever the definition would be, you'd expect it to end up being the variance of the real part if the complex part is zero. But using the mean, you'd end up with half of that if the complex part is zero. ...so well, just that to say that I would personally not be so sure users would generally expect it to be the mean of the variances or something, when getting only a single number output. It's subjective of course, and I can see the initial idea of the mean popping up in people's minds briefly, but you'd think it'd "auto-correct" relatively quickly thereafter, if that makes sense.

    In general, this goes for definitions of a lot of stuff, especially when a bit more "exotic": people will have to be aware or inform themselves before using it. Who would compute the variance of complex valued numbers if they don't know what it is or what to expect from it... no?

    Again, just my 2 cents.

    点赞 评论 复制链接分享
  • weixin_39866646 weixin_39866646 4月前

    I went ahead with the first proposal: keep std complex, add real std to options. On second thought I think std_rv is a slightly better name than std_real to prevent confusion with std(real(Z)) but feel free to throw your cents, hats or stones!

    Since complex number can not be ordered, I also specified how we do that. (Oddly enough, Matlab uses the real part for sorting and comparison operators and min and max functions use the magnitude so min(array) > max(array) can be true in Matlab land) I left the median undefined for complex numbers but changed its output from nan to N/A to be in line with std on less than 2 values.

    Here is the doc:

    
    USAGE
    
         mrstats [ options ] image
    
            image        the input image from which statistics will be computed.
    
    
    Statistics options
    
      -output field  (multiple uses permitted)
         output only the field specified. Multiple such options can be supplied if
         required. Choices are: mean, median, std, std_rv, min, max, count. Useful
         for use in scripts. Both std options refer to the unbiased (sample)
         standard deviation. For complex data, min, max and std are calculated
         separately for real and imaginary parts, std_rv is based on the real
         valued variance (equals sqrt of sum of variances of imaginary and real
         parts).
    
    点赞 评论 复制链接分享
  • weixin_39639965 weixin_39639965 4月前

    but feel free to throw your cents, hats or stones!

    :joy: Probably all good; people will read this in any case if they have to rely on it, I reckon. If that promotes peace :dove: , then all my cents are in your piggy bank. :heart:

    I think std_rv is a slightly better name than std_real to prevent confusion with std(real(Z))

    Yeah for sure; "std_real" will confuse otherwise. It's not easy though, since the standard users are likely to expect, would be what they're used to from common mathematics packages: e.g. Matlab or if Python converts, probably NumPy. They're all pretty consistent in referring to this as simply the std, and the "rv" variance as simply var. That's also the first thing that pops up, even Wikipedia style if you do the most straightforward Google. So I reckon, if the greater good requires to deviate from the common standard, you're probably better off making it obscure on purpose, so people can't find it until they're forced to read the documentation in detail. std_rv is never going to pop up in peoples' minds naturally, unless they see it in a list of things in the documentation and read what it actually means. That'll at least avoid mistakes (which std_real might certainly cause; i.e., as you've pointed out).

    点赞 评论 复制链接分享
  • weixin_39870413 weixin_39870413 4月前

    Sounds good to me too. Must admit I'd never thought about what the variance should mean for complex numbers, so happy to be educated - but like I suggested, I'd be surprised if I was alone in needing educating. It's properly documented though, so no issues. :+1:

    On a separate note, this echoes a conversation I had just yesterday with about what the noise variance map produced by dwidenoise should look like given a complex input... Do we scale by √2 so that the noise level is comparable to the magnitude version of the same data (which runs the risk of underestimation for complex inputs with zero imaginary components), or do we use the 'official' version (which would suggest higher noise / lower SNR than would be obtained using the magnitude version of the same data)...?

    点赞 评论 复制链接分享
  • weixin_39870413 weixin_39870413 4月前

    BTW, it looks like you're going to need to update the test data following these changes...

    点赞 评论 复制链接分享
  • weixin_39918128 weixin_39918128 4月前

    but feel free to throw your cents, hats or stones!

    laughs in trebuchet

    Only other suggestion would be whether to move some of the help text regarding -output choices to DESCRIPTION.

    点赞 评论 复制链接分享
  • weixin_39866646 weixin_39866646 4月前

    Ok, modified the test data to match the change from population to sample std and added a test for complex data. I first committed 2 commits for the new and amended data to master, then created a new branch based off the test data head in dev and cherry-picked and rebased these into one commit. Hope that makes sense 🤷‍♂

    what the noise variance map produced by dwidenoise should look like given a complex input.

    From the command description, noise map does not seem to be defined. Does it refer to the mean noise level 'sigma' as defined in Veraart 2016? , maybe worth clarifying.

    If "noise map" refers to variance (sigma squared?), I'd say it should be unchanged for real data and that data converted to complex data with zero imaginary part - in line with the mathematical definition of variance (and std_rv). The magnitude of said data is equal to its norm so unless I am miss something SNR should be unchanged?

    点赞 评论 复制链接分享
  • weixin_39789857 weixin_39789857 4月前

    what the noise variance map produced by dwidenoise should look like given a complex input.

    From the command description, noise map does not seem to be defined. Does it refer to the mean noise level 'sigma' as defined in Veraart 2016? , maybe worth clarifying.

    The '-noise' option of dwidenoise outputs 'sigma', i.e., the noise "standard deviation". I'll clarify this in the docs.

    If "noise map" refers to variance (sigma squared?), I'd say it should be unchanged for real data and that data converted to complex data with zero imaginary part - in line with the mathematical definition of variance (and std_rv). The magnitude of said data is equal to its norm so unless I am miss something SNR should be unchanged?

    The point is simply that if the noise is s*randn() + 1i*s*randn(), then the measured noise variance will be 2*s^2 and the noise stdev will be sqrt(2)*s. If, on the other hand, you'd only input the real channel, the noise level will be s. We were discussing that this may be confusing (in the context of denoising) to users who compare real and complex inputs. We could rescale the output for complex inputs, but that risks introducing a similar issue when the imaginary channel is zero. Ultimately, I therefore tend to agree with you that sticking to the mathematical definition is the better option; we just need to document it clearly.

    点赞 评论 复制链接分享
  • weixin_39870413 weixin_39870413 4月前

    Yep, I think I agree with here. Given that there's no way to avoid some confusion (at least for naïve people like me...), it's best to stick to the formal definition, and document this appropriately.

    点赞 评论 复制链接分享
  • weixin_39866646 weixin_39866646 4月前

    Good to merge from my side unless you want to change the help text.

    点赞 评论 复制链接分享

相关推荐