drsqpko5286
2014-12-16 12:38
浏览 331
已采纳

在Golang中读取Zlib压缩文件的最有效方法?

I'm reading in and at the same time parsing (decoding) a file in a custom format, which is compressed with zlib. My question is how can I efficiently uncompress and then parse the uncompressed content without growing the slice? I would like to parse it whilst reading it into a reusable buffer.

This is for a speed-sensitive application and so I'd like to read it in as efficiently as possible. Normally I would just ioutil.ReadAll and then loop again through the data to parse it. This time I'd like to parse it as it's read, without having to grow the buffer into which it is read, for maximum efficiency.

Basically I'm thinking that if I can find a buffer of the perfect size then I can read into this, parse it, and then write over the buffer again, then parse that, etc. The issue here is that the zlib reader appears to read an arbitrary number of bytes each time Read(b) is called; it does not fill the slice. Because of this I don't know what the perfect buffer size would be. I'm concerned that it might break up some of the data that I wrote into two chunks, making it difficult to parse because one say uint64 could be split from into two reads and therefore not occur in the same buffer read - or perhaps that can never happen and it's always read out in chunks of the same size as were originally written?

  1. What is the optimal buffer size, or is there a way to calculate this?
  2. If I have written data into the zlib writer with f.Write(b []byte) is it possible that this same data could be split into two reads when reading back the compressed data (meaning I will have to have a history during parsing), or will it always come back in the same read?

图片转代码服务由CSDN问答提供 功能建议

我正在读取并同时解析(解码)自定义格式的文件,该文件已压缩 与zlib。 我的问题是如何在不增大切片的情况下有效地解压缩然后解析未压缩的内容? 我想在将其读取到可重复使用的缓冲区时进行解析。

这是针对速度敏感的应用程序,因此我想尽可能高效地读取它。 通常,我只是 ioutil.ReadAll ,然后再次遍历数据以对其进行解析。 这次我想在读取时解析它,而不必增加读取它的缓冲区,以实现最大效率。

基本上,我在想,如果我能找到 一个大小合适的缓冲区,然后我可以读入它,解析它,然后再次写在缓冲区上,然后解析,依此类推。这里的问题是zlib阅读器似乎每次每次读取任意数量的字节 Read(b)被调用; 它不会填充切片。 因此,我不知道理想的缓冲区大小是多少。 我担心它可能会将我写入的某些数据分解为两个大块,使其难以解析,因为有人说uint64可以分为两个读取,因此不会在同一缓冲区读取中发生-也许那

  1. 最佳缓冲区大小是多少?或者有什么方法可以计算出来?
  2. 如果我使用 f.Write(b [] byte)将数据写入zlib写入器,则读取时可能会将同一数据分为两次读取 返回压缩的数据(这意味着我在解析过程中必须具有历史记录),还是总是以相同的读取结果返回?
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • douhoujun9304 2014-12-16 17:36
    已采纳

    OK, so I figured this out in the end using my own implementation of a reader.

    Basically the struct looks like this:

    type reader struct {
     at int
     n int
     f io.ReadCloser
     buf []byte
    }
    

    This can be attached to the zlib reader:

    // Open file for reading
    fi, err := os.Open(filename)
    if err != nil {
        return nil, err
    }
    defer fi.Close()
    // Attach zlib reader
    r := new(reader)
    r.buf = make([]byte, 2048)
    r.f, err = zlib.NewReader(fi)
    if err != nil {
        return nil, err
    }
    defer r.f.Close()
    

    Then x number of bytes can be read straight out of the zlib reader using a function like this:

    mydata := r.readx(10)
    
    func (r *reader) readx(x int) []byte {
        for r.n < x {
            copy(r.buf, r.buf[r.at:r.at+r.n])
            r.at = 0
            m, err := r.f.Read(r.buf[r.n:])
            if err != nil {
                panic(err)
            }
            r.n += m
        }
        tmp := make([]byte, x)
        copy(tmp, r.buf[r.at:r.at+x]) // must be copied to avoid memory leak
        r.at += x
        r.n -= x
        return tmp
    }
    

    Note that I have no need to check for EOF because I my parser should stop itself at the right place.

    打赏 评论
  • doumei8126 2014-12-16 14:05

    You can wrap your zlib reader in a bufio reader, then implement a specialized reader on top that will rebuild your chunks of data by reading from the bufio reader until a full chunk is read. Be aware that bufio.Read calls Read at most once on the underlying Reader, so you need to call ReadByte in a loop. bufio will however take care of the unpredictable size of data returned by the zlib reader for you.

    If you do not want to implement a specialized reader, you can just go with a bufio reader and read as many bytes as needed with ReadByte() to fill a given data type. The optimal buffer size is at least the size of your largest data structure, up to whatever you can shove into memory.

    If you read directly from the zlib reader, there is no guarantee that your data won't be split between two reads.

    Another, maybe cleaner, solution is to implement a writer for your data, then use io.Copy(your_writer, zlib_reader).

    打赏 评论

相关推荐 更多相似问题