weixin_39692172
2020-11-23 02:03 阅读 10

Parallel writing

This may be out of scope for this project, but I recently read an interesting article detailing how R achieves extremely fast CSV writing. Essentially - serialization, not writing, is the bottleneck - so spawn multiple workers to do the serialization, and send the bytes back to a writer thread. Would something similar be possible in rust (with rayon)? I don't see why not (though I'm sure the speed gains would not be as spectacular)

该提问来源于开源项目:BurntSushi/rust-csv

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

9条回答 默认 最新

  • weixin_39997311 weixin_39997311 2020-11-23 02:03

    Is R's CSV writing faster than this crate? That's the first question you need to answer.

    The fastest way to write CSV records is via the Writer::write_byte_record API, and that should not suffer from obvious performance bugs (like calling out to formatting routines).

    The write_record and serialize APIs will be slower, since they handle more flexible input. The serialize routines will do something like fprintf to, e.g., format numbers as strings.

    In general, I think this issue is too broad. I think someone (not me) needs to put in the hard work to gather data in the form of benchmarks. Even then, if parallelism would help, I don't see why it couldn't be built on top of this library. It certainly seems like an unnecessary complication to add to csv proper?

    I'm not sure I would use rayon for this.

    点赞 评论 复制链接分享
  • weixin_39692172 weixin_39692172 2020-11-23 02:03

    Fair points, I will try to put together some benchmarks

    点赞 评论 复制链接分享
  • weixin_39997311 weixin_39997311 2020-11-23 02:03

    Thanks! I'd be very curious to hear the results, regardless of what they are!

    点赞 评论 复制链接分享
  • weixin_39692172 weixin_39692172 2020-11-23 02:03

    I created a simple gist (with timings) here: https://gist.github.com/jesskfullwood/2349d8306c708d879d5689fe611daeea

    Overall the Rust code did not perform very well against R, presumably because the serialization overhead of Rust is greater. But I have written both in the simplest way I could think of, so am sure there are many optimizations that could be made. I will try to dig a little deeper.

    One point is that the R data.table structure is column-oriented, whereas Rust is row-oriented. I would actually expect this to act in Rust's favor.

    PS I'm a beginner at R so apologize for ugly code

    点赞 评论 复制链接分享
  • weixin_39997311 weixin_39997311 2020-11-23 02:03

    I'm not familiar with R. Could you please include the command you used to run your program? And what version of R?

    点赞 评论 复制链接分享
  • weixin_39692172 weixin_39692172 2020-11-23 02:03

    Have updated the gist (including updating the timings)

    点赞 评论 复制链接分享
  • weixin_39692172 weixin_39692172 2020-11-23 02:03

    Incidentally I agree that this issue is probably out-of-scope for this project

    点赞 评论 复制链接分享
  • weixin_39597399 weixin_39597399 2020-11-23 02:03

    I've encountered something similar where serialization is the heaviest part of creating a CSV file. Using the serialize API is extremely nice and convenient. I think it could be divided slightly to make it easier for external projects to implement their own parallel writing on top of csv.

    In particular, if there was a ByteRecord::serialize<T: Serialize>(_: &T) -> Result<ByteRecord> function, one could divide that across threads, and send the results back to a single thread to run write_byte_record serially. There's one other missing piece, which is headers: these could possibly be handled via a Writer::serialize_headers<T: Serialize>(&mut self, _: &T) function, that just does the headers and nothing else.

    点赞 评论 复制链接分享
  • weixin_39997311 weixin_39997311 2020-11-23 02:03

    I've put out what I believe is a fix to the specific performance problem reported in this issue in csv 1.0.6. In particular, here are is the benchmark output on my machine. For R:

    
    > source('writetest.R')
    [1] "generate:"
    Time difference of 55.17594 secs
    [1] "fwrite:"
    omp_get_max_threads() = 16
    omp_get_thread_limit() = 2147483647
    DTthreads = 0
    RestoreAfterFork = true
    No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
    Column writers: 5 11 3 3 11 5
    maxLineLen=77 from sample. Found in 0.000s
    Writing column names ... done in 0.000s
    Writing 10000000 rows in 184 batches of 54471 rows (each buffer size 8MB, showProgress=1, nth=16) ... done (actual nth=16, anyBufferGrown=no, maxBuffUsed=47%)
    Time difference of 2.081786 secs
    

    For Rust:

    
    $ ./target/release/writetest
    generated data in 1.780228732s
    Wrote 10000000 rows (682MB) in 2.370s (287.985MB/s)
    

    So it's still a tad slower, but within spitting distance. Before my fix, the Rust program was clocking in at about 5.2 seconds, so that's a 2x speedup. And here are the relevant micro-benchmarks:

    
    name                              before ns/iter         after ns/iter          diff ns/iter   diff %  speedup
    count_game_serialize_owned_bytes  14,252,595 (154 MB/s)  10,999,427 (200 MB/s)    -3,253,168  -22.83%   x 1.30
    count_game_serialize_owned_str    14,446,485 (152 MB/s)  11,279,116 (195 MB/s)    -3,167,369  -21.92%   x 1.28
    count_mbta_serialize_owned_bytes  2,821,205 (221 MB/s)   1,812,599 (343 MB/s)     -1,008,606  -35.75%   x 1.56
    count_mbta_serialize_owned_str    2,789,786 (223 MB/s)   1,799,033 (346 MB/s)       -990,753  -35.51%   x 1.55
    count_nfl_serialize_owned_bytes   5,275,167 (258 MB/s)   3,217,015 (424 MB/s)     -2,058,152  -39.02%   x 1.64
    count_nfl_serialize_owned_str     5,355,481 (254 MB/s)   3,039,308 (449 MB/s)     -2,316,173  -43.25%   x 1.76
    count_pop_serialize_owned_bytes   7,949,709 (120 MB/s)   4,958,633 (192 MB/s)     -2,991,076  -37.62%   x 1.60
    count_pop_serialize_owned_str     8,028,530 (118 MB/s)   4,924,101 (194 MB/s)     -3,104,429  -38.67%   x 1.63
    

    The specific speedup was achieved by switching to ryu and itoa for serializing floating pointer numbers and integers, respective, which are the same libraries used by serde_json. They are quite fast!

    Your API additions seem plausible. It might be worth opening a new ticket to track them.

    点赞 评论 复制链接分享

相关推荐