douliaodan2738 2014-04-05 00:14
浏览 521
已采纳

在go中读取文本文件中的随机行

I am using encoding/csv to read and parse a very large .csv file.
I need to randomly select lines and pass them through some test.
My current solution is to read the whole file like

reader := csv.NewReader(file)
lines, err := reader.ReadAll()

then randomly select lines from lines
The obvious problem is it takes a long time to read the whole thing and I need lots of memory.

Question:
my question is, encoding/csv gives me an io/reader is there a way to use that to read random lines instead of loading the whole thing at once?
This is more of a curiosity to learn more about io/reader than a practical question, since it is very likely that in the end it is more efficient to read it once and access it in memory, that to keep seeking random lines off on the disk.

  • 写回答

5条回答 默认 最新

  • duanbei7005 2014-04-05 04:36
    关注

    Apokalyptik's answer is the closest to what you want. Readers are streamers so you can't just hop to a random place (per-se).

    Naively choosing a probability against which you keep any given line as you read it in can lead to problems: you may get to the end of the file without holding enough lines of input, or you may be too quick to hold lines and not get a good sample. Either is much more likely than guessing correctly, since you don't know beforehand how many lines are in the file (unless you first iterate it once to count them).

    What you really need is reservoir sampling.

    Basically, read the file line-by-line. Each line, you choose whether to hold it like so: The first line you read, you have a 1/1 chance of holding it. After you read the second line, you have 1/2 chance of replacing what you're holding with this one. After the third line, you have a 1/2 * 2/3 = 1/3 chance of holding onto that one instead. Thus you have a 1/N chance of holding onto any given line, where N is the number of lines you've read in. Here's a more detailed look at the algorithm (don't try to implement it just from what I've told you in this paragraph alone).

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 使用C#,asp.net读取Excel文件并保存到Oracle数据库
  • ¥15 C# datagridview 单元格显示进度及值
  • ¥15 thinkphp6配合social login单点登录问题
  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 虚心请教几个问题,小生先有礼了
  • ¥30 截图中的mathematics程序转换成matlab