doucai4274
doucai4274
2019-06-27 21:02

在Golang中查找模式的字节偏移

  • IT行业问题
  • 计算机技术
  • it技术
  • 编程语言问答
  • 互联网问答
已采纳

We can find the byte offset of a pattern from file by "grep -ob pattern filename"; However, grep is not utf8 safe. How do I find byte offset of a pattern in Go? The file is process log, which can be in TB.

This is what I want to get in Go:

$ cat fname
hello world
findme
hello 世界
findme again

...

$ grep -ob findme fname

12:findme
32:findme
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • dongmeiwei0226 dongmeiwei0226 2年前

    FindAllStringIndex(s string, n int) returns byte start/finish indexes (i.e., slices) of all successive matches of the expression:

    package main
    
    import "fmt"
    import "io/ioutil"
    import "regexp"
    
    func main() {
        fname := "C:\\Users\\UserName\\go\\src\\so56798431\\fname"
        b, err := ioutil.ReadFile(fname)
        if err != nil {
          panic(err)
        }
    
        re, err := regexp.Compile("findme")
        if err != nil {
          // handle error
        }
        fmt.Println(re.FindAllStringIndex(string(b), -1))
    }
    

    Output:

    [[12 18] [32 38]]

    Note: I did this on Microsoft Windows, but saved the file in UNIX format (linefeed); if input file saved in Windows format (carriage return & linefeed) the byte offsets would increment to 13 and 35, respectively.

    UPDATE: for large files, use bufio.Scanner; for example:

    package main
    
    import (
        "bufio"
        "fmt"
        "log"
        "os"
        "regexp"
    )
    
    func main() {
        fname, err := os.Open("C:\\Users\\UserName\\go\\src\\so56798431\\fname")
        if err != nil {
            log.Fatal(err)
        }
        defer fname.Close()
    
        re, err := regexp.Compile("findme")
        if err != nil {
          // handle error
        }
    
        scanner := bufio.NewScanner(fname)
        bytesRead := 0
        for scanner.Scan() {
            b := scanner.Text()
            //fmt.Println(b)
            results := re.FindAllStringIndex(b, -1)
            for _, result := range results {
                fmt.Println(bytesRead + result[0])
            }
            // account for UNIX EOL marker
            bytesRead += len(b) + 1
        }
    
        if err := scanner.Err(); err != nil {
            log.Fatal(err)
        }
    }
    

    Output:

    12

    32

    点赞 评论 复制链接分享

为你推荐