dongshuang0011 2019-03-05 20:06
浏览 443
已采纳

如何提高Go中逐行读取大文件的速度

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

  • 写回答

3条回答 默认 最新

  • doucheng5209 2019-03-05 21:43
    关注

    LAST EDIT

    This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
    )
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        i := bytes.Index(dat, []byte("Iforgotmypassword"))
        if i != -1 {
            var x int
            var y int
            for x = i; x > 0; x-- {
                if dat[x] == byte('
    ') {
                    break
                }
            }
            for y = i; y < len(dat); y++ {
                if dat[y] == byte('
    ') {
                    break
                }
            }
            fmt.Println(string(dat[x : y+1]))
        }
    }
    
    real    0m0.421s
    user    0m0.068s
    sys     0m0.352s
    

    ORIGINAL ANSWER

    If you just need to see if the string is in a file, why not use regex?

    Note: I kept the data as a byte array instead of converting to string.

    package main
    
    import (
        "fmt"
        "io/ioutil"
        "regexp"
    )
    
    var regex = regexp.MustCompile(`Ilostmypassword`)
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        if regex.Match(dat) {
            fmt.Println("Yes")
        }
    }
    

    jumble.txt is a 859 MB of jumbled text with newlines included.

    Running with time ./code I get:

    real    0m0.405s
    user    0m0.064s
    sys     0m0.340s
    

    To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

    I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

    To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

    Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

    Given the following code:

    dat, _ := ioutil.ReadFile("./jumble.txt")
    sdat := bytes.Split(dat, []byte{'
    '})
    for _, l := range sdat {
        if bytes.Equal([]byte("Iforgotmypassword"), l) {
            fmt.Println("Yes")
        }
    }
    

    I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 根据以下文字信息,做EA模型图
  • ¥15 删除虚拟显示器驱动 删除所有 Xorg 配置文件 删除显示器缓存文件 重启系统 可是依旧无法退出虚拟显示器
  • ¥15 vscode程序一直报同样的错,如何解决?
  • ¥15 关于使用unity中遇到的问题
  • ¥15 开放世界如何写线性关卡的用例(类似原神)
  • ¥15 关于并联谐振电磁感应加热
  • ¥60 请查询全国几个煤炭大省近十年的煤炭铁路及公路的货物周转量
  • ¥15 请帮我看看我这道c语言题到底漏了哪种情况吧!
  • ¥60 关机时蓝屏并显示KMODE_EXCEPTION_NOT_HANDLED,怎么修?
  • ¥66 如何制作支付宝扫码跳转到发红包界面