dpt8910 2013-10-09 21:52
浏览 46
已采纳

什么会造成goroutines的巨大开销?

for an assignment we are using go and one of the things we are going to do is to parse a uniprotdatabasefile line-by-line to collect uniprot-records.

I prefer not to share too much code, but I have a working code snippet that does parse such a file (2.5 GB) correctly in 48 s (measured using the time go-package). It parses the file iteratively and add lines to a record until a record end signal is reached (a full record), and metadata on the record is created. Then the record string is nulled, and a new record is collected line-by-line. Then I thought that I would try to use go-routines.

I have got some tips before from stackoverflow, and then to the original code I simple added a function to handle everything concerning the metadata-creation.

So, the code is doing

  1. create an empty record,
  2. iterate the file and add lines to the record,
  3. if a record stop signal is found (now we have a full record) - give it to a go routine to create the metadata
  4. null the record string and continue from 2).

I also added a sync.WaitGroup() to make sure that I waited (in the end) for each routine to finish. I thought that this would actually lower the time spent on parsing the databasefile as it continued to parse while the goroutines would act on each record. However, the code seems to run for more than 20 minutes indicating that something is wrong or the overhead went crazy. Any suggestions?

package main

import (
    "bufio"
    "crypto/sha1"
    "fmt"
    "io"
    "log"
    "os"
    "strings"
    "sync"
    "time"
)

type producer struct {
    parser uniprot
}

type unit struct {
    tag string
}

type uniprot struct {
    filenames     []string
    recordUnits   chan unit
    recordStrings map[string]string
}

func main() {
    p := producer{parser: uniprot{}}
    p.parser.recordUnits = make(chan unit, 1000000)
    p.parser.recordStrings = make(map[string]string)
    p.parser.collectRecords(os.Args[1])
}

func (u *uniprot) collectRecords(name string) {
    fmt.Println("file to open ", name)
    t0 := time.Now()
    wg := new(sync.WaitGroup)
    record := []string{}
    file, err := os.Open(name)
    errorCheck(err)
    scanner := bufio.NewScanner(file)
    for scanner.Scan() { //Scan the file
        retText := scanner.Text()
        if strings.HasPrefix(retText, "//") {
            wg.Add(1)
            go u.handleRecord(record, wg)
            record = []string{}
        } else {
            record = append(record, retText)
        }
    }
    file.Close()
    wg.Wait()
    t1 := time.Now()
    fmt.Println(t1.Sub(t0))
}

func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
    defer wg.Done()
    recString := strings.Join(record, "
")
    t := hashfunc(recString)
    u.recordUnits <- unit{tag: t}
    u.recordStrings[t] = recString
}

func hashfunc(record string) (hashtag string) {
    hash := sha1.New()
    io.WriteString(hash, record)
    hashtag = string(hash.Sum(nil))
    return
}

func errorCheck(err error) {
    if err != nil {
        log.Fatal(err)
    }
}
  • 写回答

1条回答 默认 最新

  • duanheye7909 2013-10-09 23:49
    关注

    First of all: your code is not thread-safe. Mainly because you're accessing a hashmap concurrently. These are not safe for concurrency in go and need to be locked. Faulty line in your code:

    u.recordStrings[t] = recString
    

    As this will blow up when you're running go with GOMAXPROCS > 1, I'm assuming that you're not doing that. Make sure you're running your application with GOMAXPROCS=2 or higher to achieve parallelism. The default value is 1, therefore your code runs on one single OS thread which, of course, can't be scheduled on two CPU or CPU cores simultaneously. Example:

    $ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat
    

    At last: pull the values from the channel or otherwise your program will not terminate. You're creating a deadlock if the number of goroutines exceeds your limit. I tested with a 76MiB file of data, you said your file was about 2.5GB. I have 16347 entries. Assuming linear growth, your file will exceed 1e6 and therefore there are not enough slots in the channel and your program will deadlock, giving no result while accumulating goroutines which don't run to fail at the end (miserably).

    So the solution should be to add a go routine which pulls the values from the channel and does something with them.

    As a side note: If you're worried about performance, do not use strings as they're always copied. Use []byte instead.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题