如何提高Go中逐行读取大文件的速度

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doucheng5209 2019-03-05 21:43
关注
LAST EDIT

This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

package main import ( "bytes" "fmt" "io/ioutil" ) func main() { dat, _ := ioutil.ReadFile("./jumble.txt") i := bytes.Index(dat, []byte("Iforgotmypassword")) if i != -1 { var x int var y int for x = i; x > 0; x-- { if dat[x] == byte(' ') { break } } for y = i; y < len(dat); y++ { if dat[y] == byte(' ') { break } } fmt.Println(string(dat[x : y+1])) } }

real 0m0.421s user 0m0.068s sys 0m0.352s

ORIGINAL ANSWER

If you just need to see if the string is in a file, why not use regex?

Note: I kept the data as a byte array instead of converting to string.

package main import ( "fmt" "io/ioutil" "regexp" ) var regex = regexp.MustCompile(`Ilostmypassword`) func main() { dat, _ := ioutil.ReadFile("./jumble.txt") if regex.Match(dat) { fmt.Println("Yes") } }

jumble.txt is a 859 MB of jumbled text with newlines included.

Running with time ./code I get:

real 0m0.405s user 0m0.064s sys 0m0.340s

To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

Given the following code:

dat, _ := ioutil.ReadFile("./jumble.txt") sdat := bytes.Split(dat, []byte{' '}) for _, l := range sdat { if bytes.Equal([]byte("Iforgotmypassword"), l) { fmt.Println("Yes") } }

I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

逐行读取XML文件 xml
2017-11-15 22:43

回答 2 已采纳 There are some problems with your code : First is your xml file is not correct the correct xml
在 go 中逐行读取文件 golang
2012-01-06 11:50

回答 12 已采纳 There is function ReadLine in package bufio. Please note that if the line does not fit into the r
[执行]：并发逐行读取文件
2017-12-29 18:07

回答 1 已采纳 The break statement in a select breaks out of the select. The application must break out of the fo
go 逐行读取文件
2021-12-15 12:01

牛奔的博客通常，我们需要逐行读取文件。 GO 提供了 bufio 软件包，实现了有缓冲的 I/O。它包装一个 io.Reader 或 io.Writer 接口对象，创建另一个也实现了该接口，且同时还提供了缓冲和一些文本 I/O 的帮助函数的对象。在...
当某些行足够长而导致“ bufio.Scanner：令牌太长”错误时，如何在Go中逐行读取文本文件？
2014-01-14 21:24

回答 3 已采纳 From the package docs: Programs that need more control over error handling or large tokens,
buffo.Scanner逐行读取文件的奇怪行为
2014-07-23 20:06

回答 1 已采纳 The documentation of Scanner.Bytes says: The underlying array may point to data that will be o
为什么逐行读取文件时缓冲区大小不是总是4096的整数倍？
2014-07-06 15:00

回答 2 已采纳 Reading bufio.Scan's source shows that while the buffer size is 4096, it reads depending on how mu
面向CSDN编程之：golang逐行读取文件内容
2022-01-17 23:18

叨陪鲤的博客 golang逐行读取文件内容 1. 背景周末花了两天写了一个爬取、并解析HMDB数据库的工具，为了能够根据下载定义HMDB-ID的数据信息，我特意将涉及到的HMDB-ID存储到一个list.txt文件中，然后逐行读取list.txt文件，将...
Go语言逐行读取文件的三种方法
2020-01-24 15:13

信道者的博客 Go语言行读取方法介绍
Golang 中逐行读取文件内容
2019-02-12 13:05

飞渡浮舟~~的博客 Golang 中通过 bufio.NewScanner() 逐行读取文件内容 package main import ( "bufio" "fmt" "os" ) func ReadLineFile(fileName string) { if file, ...
mysql 逐行读取文件_golang逐行读取文件的操作
2021-03-03 22:03

Pa1nk1LLeR的博客总结面试中常见的类似超大文件读取的问题,通常我们采用分片读取或者逐行读取的方案即可大文件的上传也可以采用类似的解决方案 , 每次读取文件的部分内容上传(写入)网络接口中,直至文件读取完毕普通的小文件并且...
Go逐行读取文件
2022-07-30 16:09

超的博客的博客 package main import ( "bufio" "fmt" "io" "os" ) func main() { filename := "./1.txt" f, err := os.Open(filename) if err != nil { fmt.Printf("read %s fail, err...reader := bufio.NewRea...
go语言逐行读取和写入文件
2020-09-03 16:16

雪域迷影的博客今天使用go语言实现从输入文件中读取每行数据，然后将每行字段组合成SQL插入脚本，然后逐行写入另外一个空白文件中。 tb_param表的结构 tb_param表的结构如下：创建表的SQL脚本如下，我使用的是Sqlite数据库 ...
没有解决我的问题, 去提问

悬赏问题

¥15 根据以下文字信息，做EA模型图
¥15 删除虚拟显示器驱动删除所有 Xorg 配置文件删除显示器缓存文件重启系统可是依旧无法退出虚拟显示器
¥15 vscode程序一直报同样的错，如何解决?
¥15 关于使用unity中遇到的问题
¥15 开放世界如何写线性关卡的用例(类似原神）
¥15 关于并联谐振电磁感应加热
¥60 请查询全国几个煤炭大省近十年的煤炭铁路及公路的货物周转量
¥15 请帮我看看我这道c语言题到底漏了哪种情况吧！
¥60 关机时蓝屏并显示KMODE_EXCEPTION_NOT_HANDLED，怎么修？
¥66 如何制作支付宝扫码跳转到发红包界面

如何提高Go中逐行读取大文件的速度

3条回答 默认 最新

悬赏问题

3条回答默认最新