dongshuang0011 2019-03-05 20:06
浏览 443
已采纳

如何提高Go中逐行读取大文件的速度

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

  • 写回答

3条回答 默认 最新

  • doucheng5209 2019-03-05 21:43
    关注

    LAST EDIT

    This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
    )
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        i := bytes.Index(dat, []byte("Iforgotmypassword"))
        if i != -1 {
            var x int
            var y int
            for x = i; x > 0; x-- {
                if dat[x] == byte('
    ') {
                    break
                }
            }
            for y = i; y < len(dat); y++ {
                if dat[y] == byte('
    ') {
                    break
                }
            }
            fmt.Println(string(dat[x : y+1]))
        }
    }
    
    real    0m0.421s
    user    0m0.068s
    sys     0m0.352s
    

    ORIGINAL ANSWER

    If you just need to see if the string is in a file, why not use regex?

    Note: I kept the data as a byte array instead of converting to string.

    package main
    
    import (
        "fmt"
        "io/ioutil"
        "regexp"
    )
    
    var regex = regexp.MustCompile(`Ilostmypassword`)
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        if regex.Match(dat) {
            fmt.Println("Yes")
        }
    }
    

    jumble.txt is a 859 MB of jumbled text with newlines included.

    Running with time ./code I get:

    real    0m0.405s
    user    0m0.064s
    sys     0m0.340s
    

    To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

    I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

    To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

    Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

    Given the following code:

    dat, _ := ioutil.ReadFile("./jumble.txt")
    sdat := bytes.Split(dat, []byte{'
    '})
    for _, l := range sdat {
        if bytes.Equal([]byte("Iforgotmypassword"), l) {
            fmt.Println("Yes")
        }
    }
    

    I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)