duangou6446 2017-10-26 06:39
浏览 33

转到:读取zip文件中的行块[关闭]

I need to read a block of n lines in a zip files quickly as possible.

I'm beginer in Go. For bash lovers, I want to do the same as (to get a block of 500 lines between lines 199500 and 200000):

time query=$(zcat fake_contacts_200k.zip | sed '199500,200000!d')

real    0m0.106s
user    0m0.119s
sys 0m0.013s

Any idea is welcome.

  • 写回答

1条回答 默认 最新

  • duanjucong3124 2017-10-26 08:25
    关注
    1. Import archive/zip.

    2. Open and read the archive file as shown in the example right there in the docs.

      • Note that in order to mimic the behaviour of zcat you have to first check the length of the File field of the zip.ReadCloser instance returned by a call to zip.OpenReader, and fail if it is not equal to 1 — that is, there is no files in the archive or there are two or more files in it¹.

      • Note that you have to check the error value returned by a call to zip.OpenReader for being equal to zip.ErrFormat, and if it's equal, you have to:

        • Close the returned zip.ReadCloser.
        • Try to reinterpret the file as being gzip-formatted (step 4).
    3. Take the first (and sole) File member and call Open on it.

      You can then read the file's contents from the returned io.ReaderCloser.

      After reading, you need to call Close() on that instance and then close the zip file as well. That's all. ∎

    4. If step (2) failed because the file did not have the zip format, you'd test whether it's gzip-formatted.

      In order to do this, you do basically the same steps using the compress/gzip package.

      Note that contrary to the zip format, gzip does not provide file archival — it's merely a compressor, so there's no meta information on any files in the gzip stream, just the compressed data. (This fact is underlined by the difference in the names of the packages.)

      If an attempt to opening the same file as a gzip archive returns the gzip.ErrHeader error, you bail out, otherwise you read the data after which you close the reader. That's all. ∎

    To process just the specific lines from the decompressed file, you'd need to

    1. Skip the lines before the first one to process.
    2. Process the lines until, and including the last one to process.
    3. Stop processing.

    To interpret the data read from an io.Reader or io.ReadCloser, it's best to use bufio.Scanner — see the "Example (Lines)" there.

    P.S.

    Please read thoroughly this essay to try to make your next question better that this one.


    ¹ You might as well read all the files and interpret their contents as a contiguous stream — that would deviate from the behaviour of zcat but that might be better. It really depends on your data.

    评论

报告相同问题?

悬赏问题

  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
  • ¥15 神经网络怎么把隐含层变量融合到损失函数中?
  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥20 测距传感器数据手册i2c