dql7588
dql7588
2012-10-06 00:16

优化Go文件读取程序

已采纳

I'm trying to process a log file, each line of which looks something like this:

flow_stats: 0.30062869162666672 gid 0 fid 1 pkts 5.0 fldur 0.30001386666666674 avgfldur 0.30001386666666674 actfl 3142 avgpps 16.665896331902879 finfl 1

I'm interested in the pkts field and the fldur field. I've got a Python script that can read in a million-line log file, create a list for each number of packets of all the different durations, sort those lists and figure out the median in about 3 seconds.

I'm playing around with the Go programming language and thought I'd rewrite this, in the hope that it would run faster.

So far, I've been disappointed. Just reading the file in to the data structure takes about 5.5 seconds. So I'm wondering if some of you wonderful people can help me make this go (hehe) faster.

Here's my loop:

data := make(map[int][]float32)
infile, err := os.Open("tmp/flow.tr")
defer infile.Close()
if err != nil {
  panic(err)
}
reader := bufio.NewReader(infile)

line, err := reader.ReadString('
')
for {
  if len(line) == 0 {
    break
  }
  if err != nil && err != io.EOF {
    panic(err)
  }
  split_line := strings.Fields(line)
  num_packets, err := strconv.ParseFloat(split_line[7], 32)
  duration, err := strconv.ParseFloat(split_line[9], 32)
  data[int(num_packets)] = append(data[int(num_packets)], float32(duration))

  line, err = reader.ReadString('
')
}

Note that I do actually check the errs in the loop -- I've omitted that for brevity. google-pprof indicates that a majority of the time is being spent in strings.Fields by strings.FieldsFunc, unicode.IsSpace, and runtime.stringiter2.

How can I make this run faster?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • duangou6446 duangou6446 9年前

    Replacing

    split_line := strings.Fields(line)

    with

    split_line := strings.SplitN(line, " ", 11)

    Yielded ~4x speed improvement on a 1M line randomly generated file that mimicked the format you provided above:

    strings.Fields version: Completed in 4.232525975s

    strings.SplitN version: Completed in 1.111450755s

    Some of the efficiency comes from being able to avoid parsing and splitting the input line after the duration has be split, but most of it comes from the simpler splitting logic in SplitN. Even splitting all of the strings doesn't take much longer than stopping after the duration. Using:

    split_line := strings.SplitN(line, " ", -1)

    Completed in 1.554971313s

    SplitN and Fields are not the same. Fields assumes tokens are bounded by 1 or more whitespace characters, where SplitN treats tokens as anything bounded by the separator string. If your input had multiple spaces between tokens, split_line would contain empty tokens for each pair of spaces.

    Sorting and calculating the median does not add much time. I changed the code to use a float64 rather than a float32 as a matter of convenience when sorting. Here's the complete program:

    package main
    
    import (
        "bufio"
        "fmt"
        "os"
        "sort"
        "strconv"
        "strings"
        "time"
    )
    
    // SortKeys returns a sorted list of key values from a map[int][]float64.
    func sortKeys(items map[int][]float64) []int {
        keys := make([]int, len(items))
        i := 0
        for k, _ := range items {
            keys[i] = k
            i++
        }
        sort.Ints(keys)
        return keys
    }
    
    // Median calculates the median value of an unsorted slice of float64.
    func median(d []float64) (m float64) {
        sort.Float64s(d)
        length := len(d)
        if length%2 == 1 {
            m = d[length/2]
        } else {
            m = (d[length/2] + d[length/2-1]) / 2
        }
        return m
    }
    
    func main() {
        data := make(map[int][]float64)
        infile, err := os.Open("sample.log")
        defer infile.Close()
        if err != nil {
            panic(err)
        }
        reader := bufio.NewReaderSize(infile, 256*1024)
    
        s := time.Now()
        for {
            line, err := reader.ReadString('
    ')
            if len(line) == 0 {
                break
            }
            if err != nil {
                panic(err)
            }
            split_line := strings.SplitN(line, " ", 11)
            num_packets, err := strconv.ParseFloat(split_line[7], 32)
            if err != nil {
                panic(err)
            }
            duration, err := strconv.ParseFloat(split_line[9], 32)
            if err != nil {
                panic(err)
            }
            pkts := int(num_packets)
            data[pkts] = append(data[pkts], duration)
        }
    
        for _, k := range sortKeys(data) {
            fmt.Printf("pkts: %d, median: %f
    ", k, median(data[k]))
        }
        fmt.Println("
    Completed in ", time.Since(s))
    }
    

    And the output:

    pkts: 0, median: 0.498146
    pkts: 1, median: 0.511023
    pkts: 2, median: 0.501408
    ...
    pkts: 99, median: 0.501517
    pkts: 100, median: 0.491499
    
    Completed in  1.497052072s
    
    点赞 评论 复制链接分享

为你推荐