This sounds generically useful enough to support. I'd like to hear comments from others, but I'm fairly positive on it.
I wanted to open this issue to explore the possibility of incorporating blosc compression into ASDF. I know that integrating every user's favorite compression package is an unsustainable approach, but I do think that it's worth considering blosc, and that it could even be the recommended compression for binary array data in ASDF.
blosc is not a compression algorithm, per se, but a "meta compressor" that does a byte transpose in blocks then feeds the result to a familiar compressor, like zlib or lz4. For binary arrays of M-byte elements, it transposes in blocks of MxB bytes (the block size B is less important, usually at least a few thousand). The results can be dramatic—a many-fold increase in the compression factor. The byte transpose step is almost always fast compared to the compression. Since a primary use-case for ASDF is storing arrays of binary data with a known element size, blosc seems particularly well-suited to ASDF.
I needed blosc compression for my project (storing several PB of data in ASDF files), so I went ahead and implemented it in my fork here: https://github.com/lgarrison/asdf/. No other compressor could find a substantial compression opportunity in my data except blosc.
I followed the example of lz4 compression for my implementation. Many of the ASDF integration elements could use your expert advice; in particular how to automatically propagate the data type size down to the compressor, and how to propagate compression options that advanced users might want to tweak. But hopefully the binary data model is robust since it's the same block model as was used for lz4.
blosc is already available on PyPI/conda-forge and is pre-packaged with a number of backing compressors, like zstd and lz4. I find zstd to be an excellent tradeoff between compression and performance. It finds compression opportunities that lz4 does not while still providing >1.5 GB/s decompression per core, at least in my data sets.
While I'd love to see blosc included in ASDF and would be happy to put the work in to make that happen, if it's not considered appropriate for inclusion, then perhaps this would be a good time to talk about issue #273 on how to integrate user-provided compression.
And in any case, I made some performance enhancements to the block-decompression procedure that could be ported to the lz4 decompressor (fewer memory copies).
Thanks for your consideration. And thanks for developing ASDF, I've found it really useful!
12条回答 默认 最新
- 点赞 评论 复制链接分享
- weixin_39976382 2020-12-09 15:32
I would recommend looking into if we can integrate the compression library built by zarr: https://github.com/zarr-developers/numcodecs to get blosc and a bunch of other things all at once?点赞 评论 复制链接分享
I agree that's worth looking into. It would be nice if all ASDF compression passed through this library instead of each individual compression library. Even nicer would be if Numcodecs could decompress binary blobs compressed by the standalone compression libraries, so that the standalone library interfaces could be removed from
compression.pywhile still remaining backwards compatible with existing ASDF files. Probably depends on what low-level interfaces Numcodecs exposes.
We probably need to consider the stability of the binary data model of Numcodecs, too. Skimming through the documentation, I don't see any comments about the binary format being frozen. Hopefully the "binary format" is just a thin header of some kind that indicates the compression library and maybe a blocking factor, but we would want to be confident all future versions of Numcodecs would be backwards compatible. Its Zarr integration makes me think that would be the case.点赞 评论 复制链接分享
Digging into the Numcodecs source a bit more, I think perhaps the reason I couldn't find anything about the binary header format is that there isn't one! The user is expected to know the compression algorithm used, I think. That actually works well for ASDF, since it already stores the compression algorithm as a 4-byte string.
Since Numcodecs appears to be binary compatible with the libraries in their standalone form, I think this means that zlib and bzip2 compression and decompression could be rerouted through Numcodecs with full ASDF backwards compatibility. With the lz4 compression, the block structure imposed by ASDF means that we'd have to keep the blocking/unblocking code, but I think the lz4 call could be rerouted to Numcodecs as well. blosc can follow the same pattern. As far as I can tell, Numcodecs doesn't have its own blocking implementation.
I think the biggest hurdle is that ASDF's compression ID string is only 4 bytes. The Numcodecs
codec_iddoesn't appear to have a max length, so we'd need some kind of mapping from string to 4-byte ID. Or we'd need to widen the 4-byte field and bump the header version; that sounds less desirable. Is there any world in which the compression string goes in the YAML?
Any Numcodecs experts ( ?) want to weigh in on if what I've surmised about Numcodecs is correct?点赞 评论 复制链接分享
I'll mention that my own timeline for getting blosc compression working is pretty accelerated, as I'm working under deadline pressure and need to start compressing my data this week. But my sense is that what I've implemented in my fork (plain blosc compression inside the ASDF blocking) will be binary compatible with an eventual Numcodecs solution. But if anyone sees a reason why that might not be the case, please let me know!点赞 评论 复制链接分享
- weixin_39976382 2020-12-09 15:32
I have previously spent about 1 hour looking at numcodecs and came to roughly the same understanding you described, definitely not an expert :laughing:点赞 评论 复制链接分享
I'm going to take a look at it today. Once concern is how sensitive the blocking optimization varies between machines. If it is sensitive to machine variations that is a problem as far as interchange goes, but I would guess so long as a block size is chosen that is smaller than any current processor L2 cache size that it probably isn't too big a performance hit on processors with larger caches. One would guess that the bigger falloff comes when choosing a block size too large. But I have little expertise in this at the moment.点赞 评论 复制链接分享
I agree that's a concern. I explored this a bit while tuning the ASDF blocking and blosc blocking factors for my dataset, and the answers seemed to vary according to the dataset being compressed. Some showed a 50% performance dropoff when spilling L2, others didn't show any hit. Sometimes it even got faster when spilling L2! I suspect cache associativity is also a performance factor here for the transposes (i.e. blosc shuffles).
Overall, my sense from the tuning was that staying within L3 gave good performance while finding maximum compression opportunity. I ended up with sizes modestly bigger than the current 4 MB ASDF block size; I think ended up with 12 and 21 MB for two of my datasets, with 4 blosc blocks inside each (thus allowing up to 4-way blosc parallelism). I'll be interested to hear what you find.
I think an L2 or L3 heuristic might serve for most users, but we should probably add a mechanism by which advanced users can pass tunings for the ASDF block size and underlying compression.点赞 评论 复制链接分享
For the most part, the binary use in numcodec appears to be purely language agnostic though I agree that all compression info should be in yaml (from what I see, numcodec puts it in json which should map trivially). I do want to be careful that there aren't any language dependencies built into the standard.点赞 评论 复制链接分享
- weixin_39950812 2020-12-09 15:32
Here's a list of all the codec_id values I could find in the numcodecs source:
adler32 astype blosc bz2 categorize crc32 delta fixedscaleoffset gzip json json2 lz4 lzma msgpack msgpack2 packbits pickle quantize vlen-array vlen-bytes vlen-utf8 zlib zstd
The current ASDF compression values:
zlib bzp2 lz4
It's a shame, if it weren't for that quirky
bzp2, our identifiers would match.点赞 评论 复制链接分享
I notice that you worry about a 4MB block size. ASDF blocks are not limited to that. Could you explain where this comes from? Also, any reason that all the blosc blocks could not go into one ASDF block?点赞 评论 复制链接分享
The 4 MB is the default compression block size that's currently in
compression.py. I was calling it the "ASDF block size" to distinguish it from the blosc block size, even though it's nothing to do with ASDF proper but rather ASDF's block compression scheme. Sorry for the confusion!
blosc can only operate on 2 GB at once, so it needs to be fed data in blocks that are smaller than that. These blocks could probably be much larger than 4 MB since it's going to apply its own blocking internally, but it didn't seem to matter much in my tests. And if we want to trigger effective overlap of compute and IO, either implicitly through readahead or explicitly through some future ASDF feature, then it seems beneficial to keep the outer block size not too big, in the 1s to 100s of MB range.点赞 评论 复制链接分享