weixin_39638647
weixin_39638647
2020-11-30 16:47

Problem writing netcdf from xarray directly to S3

I'm trying to write a netcdf file directly from xarray to S3 object storage. I'm wondering:

  1. Why writing NetCDF files requires a "seek"
  2. Why the scipy engine is getting used instead of the specified netcdf4 engine.
  3. If there are nice workarounds (besides writing the NetCDF file locally, then using the AWS CLI to transfer to S3)

Code sample:

python
import fsspec
import xarray as xr

ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
                     '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')

outfile = fsspec.open('s3://chs-pangeo-data-bucket/rsignell/test.nc', 
                      mode='wb', profile='default')

with outfile as f:
    ds.to_netcdf(f, engine='netcdf4')

which produces:

python-traceback
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-024939f31fe4> in <module>
      2                       mode='wb', profile='default')
      3 with outfile as f:
----> 4     ds.to_netcdf(f, engine='netcdf4')

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1552             unlimited_dims=unlimited_dims,
   1553             compute=compute,
-> 1554             invalid_netcdf=invalid_netcdf,
   1555         )
   1556 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1102     finally:
   1103         if not multifile and compute:
-> 1104             store.close()
   1105 
   1106     if not compute:

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/scipy_.py in close(self)
    221 
    222     def close(self):
--> 223         self._manager.close()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/file_manager.py in close(***failed resolving arguments***)
    331     def close(self, needs_lock=True):
    332         del needs_lock  # ignored
--> 333         self._value.close()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in close(self)
    297         if hasattr(self, 'fp') and not self.fp.closed:
    298             try:
--> 299                 self.flush()
    300             finally:
    301                 self.variables = OrderedDict()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in flush(self)
    407         """
    408         if hasattr(self, 'mode') and self.mode in 'wa':
--> 409             self._write()
    410     sync = flush
    411 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in _write(self)
    411 
    412     def _write(self):
--> 413         self.fp.seek(0)
    414         self.fp.write(b'CDF')
    415         self.fp.write(array(self.version_byte, '>b').tostring())

/srv/conda/envs/pangeo/lib/python3.7/site-packages/fsspec/spec.py in seek(self, loc, whence)
   1122         loc = int(loc)
   1123         if not self.mode == "rb":
-> 1124             raise OSError("Seek only available in read mode")
   1125         if whence == 0:
   1126             nloc = loc

OSError: Seek only available in read mode
</module></ipython-input-3-024939f31fe4>
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138-114.102.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.3 pydap: installed h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: 2.4.0 bottleneck: None dask: 2.14.0 distributed: 2.14.0 matplotlib: 3.2.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.1 IPython: 7.13.0 sphinx: None

该提问来源于开源项目:pydata/xarray

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

6条回答

  • weixin_39769183 weixin_39769183 5月前

    Not sure, but I think the h5netcdf engine is the only one that allows for file-like objects (so anything going through fsspec)

    点赞 评论 复制链接分享
  • weixin_39638647 weixin_39638647 5月前

    Okay , I tried setting engine='h5netcdf', but still got:

    
    OSError: Seek only available in read mode
    

    Thinking about this a little more, it's pretty clear why writing NetCDF to S3 would require seek mode.

    I asked about supporting seek for writing in fsspec and he said that would be pretty hard. And in fact, the performance probably would be pretty terrible as lots of little writes would be required.

    So maybe it's best just to write netcdf files locally and then push them to S3.

    And to facilitate that, merged a PR yesterday to enable simplecache for writing in fsspec, so after doing:

    
    pip install git+https://github.com/intake/filesystem_spec.git
    

    in my environment, this now works:

    python
    import xarray as xr
    import fsspec
    
    ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
                            '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')
    
    outfile = fsspec.open('simplecache::s3://chs-pangeo-data-bucket/rsignell/foo2.nc', 
                          mode='wb', s3=dict(profile='default'))
    with outfile as f:
        ds.to_netcdf(f)
    

    (Here I'm telling fsspec to use the AWS credentials in my "default" profile)

    Thanks Martin!!!

    点赞 评论 复制链接分享
  • weixin_39941620 weixin_39941620 5月前

    I think we should add some documentation on this stuff.

    We have "cloud storage buckets" under zarr( https://xarray.pydata.org/en/stable/io.html#cloud-storage-buckets) so maybe a similar section under netCDF?

    点赞 评论 复制链接分享
  • weixin_39553757 weixin_39553757 5月前

    The write feature for simplecache isn't released yet, of course.

    It would be interesting if someone could subclass file and write locally with h5netcdf to see what kind of seeks it does. Is it popping back to some file header to update array sizes? Presumably it would need a fixed-size header to do that. Parquet and other cloud formats have the metadata at the footer exactly for this reason, so you only write once you know everything and you only ever move forward in the fie.

    点赞 评论 复制链接分享
  • weixin_39638647 weixin_39638647 5月前

    , I asked offline and he reminded me that:

    File metadata are dispersed throughout an HDF5 [and NetCDF4] file in order to support writing and modifying array sizes at any time of execution

    Looking forward to simplecache:: for writing in fsspec=0.7.5!

    点赞 评论 复制链接分享
  • weixin_39619270 weixin_39619270 5月前

    I’ve run into this as well. It’s not pretty, but my usual work around is to write to a local temporary file and then upload with fsspec. I can never remember exactly which netCDF engine to use...

    点赞 评论 复制链接分享

相关推荐