dongshuang0011 2019-03-05 20:06
浏览 443
已采纳

如何提高Go中逐行读取大文件的速度

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

  • 写回答

3条回答 默认 最新

  • doucheng5209 2019-03-05 21:43
    关注

    LAST EDIT

    This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
    )
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        i := bytes.Index(dat, []byte("Iforgotmypassword"))
        if i != -1 {
            var x int
            var y int
            for x = i; x > 0; x-- {
                if dat[x] == byte('
    ') {
                    break
                }
            }
            for y = i; y < len(dat); y++ {
                if dat[y] == byte('
    ') {
                    break
                }
            }
            fmt.Println(string(dat[x : y+1]))
        }
    }
    
    real    0m0.421s
    user    0m0.068s
    sys     0m0.352s
    

    ORIGINAL ANSWER

    If you just need to see if the string is in a file, why not use regex?

    Note: I kept the data as a byte array instead of converting to string.

    package main
    
    import (
        "fmt"
        "io/ioutil"
        "regexp"
    )
    
    var regex = regexp.MustCompile(`Ilostmypassword`)
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        if regex.Match(dat) {
            fmt.Println("Yes")
        }
    }
    

    jumble.txt is a 859 MB of jumbled text with newlines included.

    Running with time ./code I get:

    real    0m0.405s
    user    0m0.064s
    sys     0m0.340s
    

    To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

    I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

    To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

    Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

    Given the following code:

    dat, _ := ioutil.ReadFile("./jumble.txt")
    sdat := bytes.Split(dat, []byte{'
    '})
    for _, l := range sdat {
        if bytes.Equal([]byte("Iforgotmypassword"), l) {
            fmt.Println("Yes")
        }
    }
    

    I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 使用ESP8266连接阿里云出现问题
  • ¥15 BP神经网络控制倒立摆
  • ¥20 要这个数学建模编程的代码 并且能完整允许出来结果 完整的过程和数据的结果
  • ¥15 html5+css和javascript有人可以帮吗?图片要怎么插入代码里面啊
  • ¥30 Unity接入微信SDK 无法开启摄像头
  • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
  • ¥20 cad图纸,chx-3六轴码垛机器人
  • ¥15 移动摄像头专网需要解vlan
  • ¥20 access多表提取相同字段数据并合并
  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角