dongshuang0011
2019-03-05 20:06
浏览 305
已采纳

如何提高Go中逐行读取大文件的速度

I'm trying to figure out the most fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb large

    package main

    import (
        "bufio"
        "fmt"
        "os"
        "strings"
    )

    func main() {
        f, err := os.Open("./crackstation-human-only.txt")

        scanner := bufio.NewScanner(f)
        if err != nil {
            panic(err)
        }
        defer f.Close()

        for scanner.Scan() {
            if strings.Contains(scanner.Text(), "Iforgotmypassword") {
                fmt.Println(scanner.Text())
            }
        }
    }

After building the program and timing it on my machine it runs over 3 seconds ./speed 3.13s user 1.25s system 122% cpu 3.563 total

After increasing the buffer

buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

It gets a little better ./speed 2.47s user 0.25s system 104% cpu 2.609 total

I know it can get better because other tools mange to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?

0.33s user 0.14s system 94% cpu 0.501 total

图片转代码服务由CSDN问答提供 功能建议

我正在尝试找出最快的方法来逐行读取大文件并检查该行 包含一个字符串。 我正在测试的文件大约为680mb

 包main 
 
 import(
“ bufio” 
  “ fmt” 
“ os” 
“字符串” 
)
 
 func main(){
f,err:= os.Open(“ ./ crackstation-human-only.txt”)
 \  n扫描仪:= bufio.NewScanner(f)
如果出错!= nil {
恐慌(err)
} 
推迟f.Close()
 
扫描程序.Scan(){
如果是字符串 .contains(scanner.Text(),“ Iforgotmypassword”){
 fmt.Println(scanner.Text())
} 
} 
} 
   
 
 <  p>构建程序并在我的计算机上对其计时后,它将运行3秒
 。/速度3.13s用户1.25s系统122%cpu 3.563总计  
 
 

增加缓冲区后

  buf:= make([] byte,64 * 1024)
scanner.Buffer(buf,bufio.MaxScanTokenSize)
  <  / pre> 
 
 

效果会好一些 。/速度2.47s用户0.25s系统104%cpu 2.609 tota l

我知道它会变得更好,因为其他工具无需任何索引就可以在一秒钟内完成它。

0.33s用户0.14s系统94%cpu 0.501总计

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

3条回答 默认 最新

  • doucheng5209 2019-03-05 21:43
    已采纳

    LAST EDIT

    This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
    )
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        i := bytes.Index(dat, []byte("Iforgotmypassword"))
        if i != -1 {
            var x int
            var y int
            for x = i; x > 0; x-- {
                if dat[x] == byte('
    ') {
                    break
                }
            }
            for y = i; y < len(dat); y++ {
                if dat[y] == byte('
    ') {
                    break
                }
            }
            fmt.Println(string(dat[x : y+1]))
        }
    }
    
    real    0m0.421s
    user    0m0.068s
    sys     0m0.352s
    

    ORIGINAL ANSWER

    If you just need to see if the string is in a file, why not use regex?

    Note: I kept the data as a byte array instead of converting to string.

    package main
    
    import (
        "fmt"
        "io/ioutil"
        "regexp"
    )
    
    var regex = regexp.MustCompile(`Ilostmypassword`)
    
    func main() {
        dat, _ := ioutil.ReadFile("./jumble.txt")
        if regex.Match(dat) {
            fmt.Println("Yes")
        }
    }
    

    jumble.txt is a 859 MB of jumbled text with newlines included.

    Running with time ./code I get:

    real    0m0.405s
    user    0m0.064s
    sys     0m0.340s
    

    To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.

    I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.

    To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.

    Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:

    Given the following code:

    dat, _ := ioutil.ReadFile("./jumble.txt")
    sdat := bytes.Split(dat, []byte{'
    '})
    for _, l := range sdat {
        if bytes.Equal([]byte("Iforgotmypassword"), l) {
            fmt.Println("Yes")
        }
    }
    

    I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.

    已采纳该答案
    评论
    解决 无用
    打赏 举报
  • douna4762 2019-03-05 20:28

    You might try using goroutines to process multiple lines in parallel:

    lines := make(chan string, numWorkers * 2) // give the channel enough room to put lots of things in so the reader isn't blocked
    
    go func(scanner *bufio.Scanner, out <-chan string) {
      for scanner.Scan() {
        out <- scanner.Text()
      }
      close(out)
    }(scanner, lines)
    
    var wg sync.WaitGroup
    wg.Add(numWorkers)
    
    for i := 0; i < numWorkers; i++ {
      go func(in chan<- string) {
        defer wg.Done()
        for text := range in {
          if strings.Contains(text, "Iforgotmypassword") {
            fmt.Println(scanner.Text())
          }
        }
      }(lines)
    }
    
    wg.Wait()
    

    I'm not sure how much this will really speed things up as it depends on what kind of hardware you have available; it sounds like you're looking for a more than 5x speed improvement, so you might notice if you're running something that can support 8 parallel worker threads. Feel free to use lots of worker-goroutines. Good luck.

    评论
    解决 无用
    打赏 举报
  • duanju8431 2019-03-05 21:49

    Using my own 700MB test file with your original, time was just over 7 seconds

    With grep it was 0.49 seconds

    With this program (which doesn't print out the line, it just says yes) 0.082 seconds

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
        "os"
    )
    
    func check(e error) {
        if e != nil {
            panic(e)
        }
    }
    func main() {
        find := []byte(os.Args[1])
        dat, err := ioutil.ReadFile("crackstation-human-only.txt")
        check(err)
        if bytes.Contains(dat, find) {
            fmt.Print("yes")
        }
    }
    
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题