weixin_39886024
weixin_39886024
2020-11-30 16:44

float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray

Code Sample :

Considering a netcdf file file with the following variable:


short agc_40hz(time, meas_ind) ;
        agc_40hz:_FillValue = 32767s ;
        agc_40hz:units = "dB" ;
        agc_40hz:scale_factor = 0.01 ;

Code:

python
from netCDF4 import Dataset
import xarray as xr

d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc")
b = x['agc_40hz'].values.flatten()[69] ## 21.939998626708984 'numpy.float32'
abs(a - b) # 0.000001373291017

Problem description :

Different behaviour of xarray comparing to netCDF4 Dataset

When reading the dataset with xarray we found that the decoded type was numpy.float32 instead of numpy.float64 This netcdf variable has an int16 dtype when the variable is read with the netCDF4 library directly, it is automatically converted to numpy.float64. in our case we loose on precision when using xarray. We found two solutions for this:

First solution :

This solution aims to prevent auto_maskandscale

python
d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc", mask_and_scale=False, decode_times=False)
b = x['agc_40hz'].values.flatten()[69] ## 21.940000000000001 'numpy.float64'
abs(a - b) # 0.000000000000000

Modification in xarray/backends/netCDF4_.py line 241

python
def _disable_auto_decode_variable(var):
    """Disable automatic decoding on a netCDF4.Variable.

    We handle these types of decoding ourselves.
    """
    pass
    # var.set_auto_maskandscale(False)
    # # only added in netCDF4-python v1.2.8
    # with suppress(AttributeError):
    #     var.set_auto_chartostring(False)

Second solution :

This solution uses numpy.float64 whatever integer type provided.

python
d = Dataset("test.nc")
a = d.variables['agc_40hz'][:].flatten()[69] ## 21.940000000000001 'numpy.float64'
x = xr.open_dataset("test.nc")
b = x['agc_40hz'].values.flatten()[69] ## 21.940000000000001 'numpy.float64'
abs(a - b) # 0.000000000000000

Modification in xarray/core/dtypes.py line 85

python
def maybe_promote(dtype):
    """Simpler equivalent of pandas.core.common._maybe_promote

    Parameters
    ----------
    dtype : np.dtype

    Returns
    -------
    dtype : Promoted dtype that can hold missing values.
    fill_value : Valid missing value for the promoted dtype.
    """
    # N.B. these casting rules should match pandas
    if np.issubdtype(dtype, np.floating):
        fill_value = np.nan
    elif np.issubdtype(dtype, np.integer):
        #########################
        #OLD CODE BEGIN
        #########################
        # if dtype.itemsize <= 2:
        #     dtype = np.float32
        # else:
        #     dtype = np.float64
        #########################
        #OLD CODE END
        #########################

        #########################
        #NEW CODE BEGIN
        #########################
        dtype = np.float64 # whether it's int16 or int32 we use float64
        #########################
        #NEW CODE END
        #########################
        fill_value = np.nan
    elif np.issubdtype(dtype, np.complexfloating):
        fill_value = np.nan + np.nan * 1j
    elif np.issubdtype(dtype, np.datetime64):
        fill_value = np.datetime64('NaT')
    elif np.issubdtype(dtype, np.timedelta64):
        fill_value = np.timedelta64('NaT')
    else:
        dtype = object
        fill_value = np.nan
    return np.dtype(dtype), fill_value

Solution number 2 would be great for us. At this point we don't know if this modification would introduce some side effects. Is there another way to avoid this problem ?

Expected Output

In our case we expect the variable to be in numpy.float64 as it is done by netCDF4.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-23-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.3 numpy: 1.15.0rc2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 40.0.0 pip: 10.0.1 conda: None pytest: 3.6.3 IPython: 6.4.0 sphinx: None

该提问来源于开源项目:pydata/xarray

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

22条回答

  • weixin_39886024 weixin_39886024 5月前

    Thank you for your quick answer. In our case we could evaluate std dev or square sums on long lists of values and the accumulation of those small values due to float32 type could create considerable differences.

    点赞 评论 复制链接分享
  • weixin_39760857 weixin_39760857 5月前

    You're right when you say

    Note that it's very easy to later convert from float32 to float64, e.g., by writing ds.astype(np.float64).

    You'll have a float64 in the end but you won't get your precision back and it might be a problem in some case.

    I understand the benefits of using float32 on the memory side but it is kind of a problem for us each time we have variables using scale factors.

    I'm surprised this issue (if considered as one) does not bother more people.

    点赞 评论 复制链接分享
  • weixin_39886024 weixin_39886024 5月前

    As mentioned in the original issue the modification is straightforward. Any ideas if this could be integrated to xarray anytime soon ?

    点赞 评论 复制链接分享
  • weixin_39940957 weixin_39940957 5月前

    As mentioned in the original issue the modification is straightforward. Any ideas if this could be integrated to xarray anytime soon ?

    Some people might prefer float32, so it is not as straightforward as it seems. It might be possible to add an option for this, but I didn't look into the details.

    You'll have a float64 in the end but you won't get your precision back

    Note that this is a fake sense of precision, because in the example above the compression used is lossy, i.e. precision was lost at compression and the actual precision is now 0.01:

    
    short agc_40hz(time, meas_ind) ;
            agc_40hz:_FillValue = 32767s ;
            agc_40hz:units = "dB" ;
            agc_40hz:scale_factor = 0.01 ;
    
    点赞 评论 复制链接分享
  • weixin_39795116 weixin_39795116 5月前

    A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the * 0.01:

    python
    >>> import numpy as np
    >>> np.float32(2194 * 0.01)
    21.94
    

    What you're seeing is an artifact of printing out the values. I have no idea why something is printing out a float (only 7 decimal digits) out to 17 digits. Even float64 only has 16 digits (which is overkill for this application).

    The difference in subtracting the 32- and 64-bit values above are in the 8th decimal place, which is beyond the actual precision of the data; what you've just demonstrated is the difference in precision between 32-bit and 64-bit values, but it had nothing to do whatsoever with the data.

    If you're really worried about precision round-off for things like std. dev, you should probably calculate it using the raw integer values and scale afterwards. (I don't actually think this is necessary, though.)

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the * 0.01:

    Right. The actual raw data is being stored as an integer 21940 (along with the scale factor of 0.01). Both 21.939998626708984 (as float32) and 21.940000000000001 (as float64) are floating point approximations of the exact decimal number 219.40.

    I would be happy to add options for whether to default to float32 or float64 precision. There are clearly tradeoffs here: - float32 uses half the memory - float64 has more precision for downstream computation

    I don't think we can make a statement about which is better in general. The best we can do is make an educated guess about which will be more useful / less surprising for most and/or new users, and pick that as the default.

    点赞 评论 复制链接分享
  • weixin_39795116 weixin_39795116 5月前

    But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use data.astype(np.float64)". It's completely identical to doing it internally to xarray.

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use data.astype(np.float64)". It's completely identical to doing it internally to xarray.

    It's almost but not quite identical. The difference is that the data gets scaled twice. This adds twice the overhead for scaling the values (which to be fair is usually negligible compared to IO).

    Also, to get exactly equivalent numerics for further computation you would need to round again, e.g., data.astype(np.float64).round(np.ceil(-np.log10(data.encoding['scale_factor']))). This starts to get a little messy :).

    点赞 评论 复制链接分享
  • weixin_39795116 weixin_39795116 5月前

    I'm not following why the data are scaled twice.

    Your point about the rounding being different is well-taken, though.

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    I'm not following why the data are scaled twice.

    We automatically scale the data from int16->float32 upon reading it in xarray (if decode_cf=True). There's no way to turn that off and still get automatic scaling, so the best you can do is layer on int16->float32->float64, when you might prefer to only do int16->float64.

    点赞 评论 复制链接分享
  • weixin_39795116 weixin_39795116 5月前

    Ah, ok, not scaling per-se (i.e. * 0.01), but a second round of value conversion.

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    Both multiplying by 0.01 and float32 -> float64 are approximately equivalently expensive. The cost is dominated by the memory copy.

    On Mon, Aug 6, 2018 at 10:17 AM Ryan May wrote:

    Ah, ok, not scaling per-se (i.e. * 0.01), but a second round of value conversion.

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2304#issuecomment-410782982, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1oEOX3WI7oaPDOQb7R59UgDyPXDsks5uOHozgaJpZM4VbG9w .

    点赞 评论 复制链接分享
  • weixin_39760857 weixin_39760857 5月前

    To explain the full context and why it became some kind of a problem to us :

    We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on).

    We're now looking at xarray and the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on).

    Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values.

    There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem.

    Thanks for taking the time to look into this.

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    Please let us know if converting to float64 explicitly and rounding again does not solve this issue for you.

    On Mon, Aug 6, 2018 at 10:47 AM Thomas Zilio wrote:

    To explain the full context and why it became some kind of a problem to us :

    We're experimenting with the parquet format (via pyarrow) and we first did something like : netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on).

    We're now looking at xarray and the the huge ease of access it offers to netcdf like data and we tried something similar : netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on).

    Our problem appears when we're reading and comparing the data stored with these 2 approches. The difference between the 2 was - sometimes - larger than what expected/acceptable (10e-6 for float32 if I'm not mistaken). We're not constraining any type and letting the system and modules decide how to encode what and in the end we have significantly different values.

    There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem.

    Thanks for taking the time to look into this.

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2304#issuecomment-410792506, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1iZHdJnGlkA_dHGHFonA27lIM2xHks5uOIErgaJpZM4VbG9w .

    点赞 评论 复制链接分享
  • weixin_39760857 weixin_39760857 5月前

    So, a more complete example showing this problem. NetCDF file used in the example : test.nc.zip

    `python
    from netCDF4 import Dataset
    import xarray as xr
    import numpy as np
    import pandas as pd
    
    d = Dataset("test.nc")
    v = d.variables['var']
    
    print(v)
    #<class>
    #int16 var(idx)
    #    _FillValue: 32767
    #    scale_factor: 0.01
    #unlimited dimensions: 
    #current shape = (2,)
    #filling on
    
    df_nc = pd.DataFrame(data={'var': v[:]})
    
    print(df_nc)
    #     var
    #0  21.94
    #1  27.04
    
    
    ds = xr.open_dataset("test.nc")
    df_xr = ds['var'].to_dataframe()
    
    # Comparing both dataframes with float32 precision (1e-6)
    mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6)
    
    print(mask)
    #[False  True]
    
    print(df_xr)
    #           var
    #idx           
    #0    21.939999
    #1    27.039999
    
    # Changing the type and rounding the xarray dataframe
    df_xr2 = df_xr.astype(np.float64).round(int(np.ceil(-np.log10(ds['var'].encoding['scale_factor']))))
    mask = np.isclose(df_nc['var'], df_xr2['var'], rtol=0, atol=1e-6)
    
    print(mask)
    #[ True  True]
    
    print(df_xr2)
    #       var
    #idx       
    #0    21.94
    #1    27.04
    
    </class>

    `

    As you can see, the problem appears early in the process (not related to the way data are stored in parquet later on) and yes, rounding values does solve it.

    点赞 评论 复制链接分享
  • weixin_39886024 weixin_39886024 5月前

    Any updates about this ?

    点赞 评论 复制链接分享
  • weixin_39940957 weixin_39940957 5月前

    I think we are still talking about different things. In the example by -Z above there is still a problem at the line:

    python
    # Comparing both dataframes with float32 precision (1e-6)
    mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6)
    

    As discussed several times above, this test is misleading: it should assert for atol=0.01, which is the real accuracy of the underlying data. For this purpose float32 is more than good enough.

    said:

    I would be happy to add options for whether to default to float32 or float64 precision.

    so we would welcome a PR in this direction! I don't think we need to change the default behavior though, as there is a slight possibility that some people are relying on the data being float32.

    点赞 评论 复制链接分享
  • weixin_39886024 weixin_39886024 5月前

    Hi, thank you for your effort into making xarray a great library. As mentioned in the issue the discussion went on a PR in order to make xr.open_dataset parametrable. This post is about asking you about recommendations regarding our PR.

    In this case we would add a parameter to the open_dataset function called "force_promote" which is a boolean and False by default and thus not mandatory. And then spread that parameter down to the function maybe_promote in dtypes.py Where we say the following:

    if dtype.itemsize <= 2 and not force_promote: dtype = np.float32 else: dtype = np.float64

    The downside of that is that we somehow pollute the code with a parameter that is used in a specific case.

    The second approach would check the value of an environment variable called "XARRAY_FORCE_PROMOTE" if it exists and set to true would force promoting type to float64.

    please tells us which approach suits best your vision of xarray.

    Regards.

    点赞 评论 复制链接分享
  • weixin_39561673 weixin_39561673 5月前

    Hi everyone, I've start using xarray recently, so I apologize if I'm saying something wrong... I've also faced the here reported issue, so have tried to find some answers. Unpacking netcdf files with respect to the NUG attributes (scale_factor and add_offset) seems to be mentioned by the CF-Conventions directives. And it's clear about which data type should be applied to the unpacked data. cf-conventions-1.7/packed-data In this chapter you can read that: "If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes". In my opinion this should be the default behavior of the xarray.decode_cf function. Which doesn't invalidate the idea of forcing the unpacked data dtype. However non of the CFScaleOffsetCoder and CFMaskCoder de/encoder classes seems to be according with these CF directives, since the first one doesn't look for the scale_factor or add_offset dtypes, and the second one also changes the unpacked data dtype (maybe because nan values are being used to replace the fill values). Sorry for such an extensive comment, without any solutions proposal... Regards! :+1:

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    thanks for pointing this out -- I think we simplify missed this part of the CF conventions document!

    Looking at the dtype for add_offset and scale_factor does seem like a much cleaner way to handle this issue. I think we should give that a try!

    We will still need some fall-back choice for CFMaskCoder if neither a add_offset or scale_factor attribute is provided (due to xarray's representation of missing values as NaN), but this is a relatively uncommon situation.

    点赞 评论 复制链接分享
  • weixin_39781209 weixin_39781209 5月前

    Hey everyone, tumbled on this while searching for approximately the same problem. Thought I'd share since the issue is still open. On my part, there is two situations that seem buggy. I haven't been using xarray for that long yet so maybe there is something I'm missing here...

    My first problem relates to the data types of dimensions with float notation. To give another answer to 's question:

    To clarify: why is it a problem for you

    it is a problem in my case because I would like to perform slicing operations of a dataset using longitude values from another dataset. This operation raises a "KeyError : not all values found in index 'longitude'" since either one of the dataset's longitude is float32 and the other is float64 or because both datasets' float32 approximations are not exactly the same value in each dataset. I can work around this and assign new coords to be float64 after reading and it works, though it is kind of a hassle considering I have to perform this thousands of times. This situation also create a problem when concatenating multiple netCDF files together (along time dim in my case). The discrepancies between the approximations of float32 values or the float32 vs float 64 situation will add new dimension values where it shouldn't.

    On the second part of my problem, it comes with writing/reading netCDF files (maybe more related to problem). I tried to change the data type to float64 for all my files, save them and then perform what I need to do, but for some reason even though dtype is float64 for all my dimensions when writing the files (using default args), it will sometime be float32, sometime float64 when reading the files (with default ags values) previously saved with float64 dtype. If using the default args, shouldn't the decoding makes the dtype of dimension the same for all files I read?

    点赞 评论 复制链接分享
  • weixin_39610721 weixin_39610721 5月前

    To clarify: why is it a problem for you to get floating point values like 21.939998626708984 instead of 21.940000000000001? Is it a loss of precision in some downstream calculation? Both numbers are accurate well within the precision indicated by the netCDF file (0.01).

    Note that it's very easy to later convert from float32 to float64, e.g., by writing ds.astype(np.float64).

    点赞 评论 复制链接分享

相关推荐