什么会造成goroutines的巨大开销？

for an assignment we are using go and one of the things we are going to do is to parse a uniprotdatabasefile line-by-line to collect uniprot-records.

I prefer not to share too much code, but I have a working code snippet that does parse such a file (2.5 GB) correctly in 48 s (measured using the time go-package). It parses the file iteratively and add lines to a record until a record end signal is reached (a full record), and metadata on the record is created. Then the record string is nulled, and a new record is collected line-by-line. Then I thought that I would try to use go-routines.

I have got some tips before from stackoverflow, and then to the original code I simple added a function to handle everything concerning the metadata-creation.

So, the code is doing

create an empty record,
iterate the file and add lines to the record,
if a record stop signal is found (now we have a full record) - give it to a go routine to create the metadata
null the record string and continue from 2).

I also added a sync.WaitGroup() to make sure that I waited (in the end) for each routine to finish. I thought that this would actually lower the time spent on parsing the databasefile as it continued to parse while the goroutines would act on each record. However, the code seems to run for more than 20 minutes indicating that something is wrong or the overhead went crazy. Any suggestions?

package main

import (
    "bufio"
    "crypto/sha1"
    "fmt"
    "io"
    "log"
    "os"
    "strings"
    "sync"
    "time"
)

type producer struct {
    parser uniprot
}

type unit struct {
    tag string
}

type uniprot struct {
    filenames     []string
    recordUnits   chan unit
    recordStrings map[string]string
}

func main() {
    p := producer{parser: uniprot{}}
    p.parser.recordUnits = make(chan unit, 1000000)
    p.parser.recordStrings = make(map[string]string)
    p.parser.collectRecords(os.Args[1])
}

func (u *uniprot) collectRecords(name string) {
    fmt.Println("file to open ", name)
    t0 := time.Now()
    wg := new(sync.WaitGroup)
    record := []string{}
    file, err := os.Open(name)
    errorCheck(err)
    scanner := bufio.NewScanner(file)
    for scanner.Scan() { //Scan the file
        retText := scanner.Text()
        if strings.HasPrefix(retText, "//") {
            wg.Add(1)
            go u.handleRecord(record, wg)
            record = []string{}
        } else {
            record = append(record, retText)
        }
    }
    file.Close()
    wg.Wait()
    t1 := time.Now()
    fmt.Println(t1.Sub(t0))
}

func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
    defer wg.Done()
    recString := strings.Join(record, "
")
    t := hashfunc(recString)
    u.recordUnits <- unit{tag: t}
    u.recordStrings[t] = recString
}

func hashfunc(record string) (hashtag string) {
    hash := sha1.New()
    io.WriteString(hash, record)
    hashtag = string(hash.Sum(nil))
    return
}

func errorCheck(err error) {
    if err != nil {
        log.Fatal(err)
    }
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanheye7909 2013-10-09 23:49
关注
First of all: your code is not thread-safe. Mainly because you're accessing a hashmap concurrently. These are not safe for concurrency in go and need to be locked. Faulty line in your code:

u.recordStrings[t] = recString

As this will blow up when you're running go with GOMAXPROCS > 1, I'm assuming that you're not doing that. Make sure you're running your application with GOMAXPROCS=2 or higher to achieve parallelism. The default value is 1, therefore your code runs on one single OS thread which, of course, can't be scheduled on two CPU or CPU cores simultaneously. Example:

$ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat

At last: pull the values from the channel or otherwise your program will not terminate. You're creating a deadlock if the number of goroutines exceeds your limit. I tested with a 76MiB file of data, you said your file was about 2.5GB. I have 16347 entries. Assuming linear growth, your file will exceed 1e6 and therefore there are not enough slots in the channel and your program will deadlock, giving no result while accumulating goroutines which don't run to fail at the end (miserably).

So the solution should be to add a go routine which pulls the values from the channel and does something with them.

As a side note: If you're worried about performance, do not use strings as they're always copied. Use []byte instead.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

什么会造成goroutines的巨大开销？
2013-10-09 21:52

回答 1 已采纳 First of all: your code is not thread-safe. Mainly because you're accessing a hashmap concurrently
Java 线程数过多会造成什么异常？ java 后端
2021-09-25 21:11

回答 1 已采纳不知道你想问什么，但是只能告诫一句，线程太多太多的话，程序死掉，资源耗尽，然后崩溃。。。。
echo + fread一个文件下载会导致什么开销？ php
2016-05-16 20:36

回答 1 已采纳 Yes, it's read into a buffer. It's essentially equivalent to: $temp = fread($file, filesize($path
2024 Python3.10 系统入门+进阶（一）：Python编程基础
2024-03-15 10:30

Amo Xiang的博客目录一、编程语言是什么二、编译型语言和解释型语言的区别 2.1 编译型语言 2.2 解释型语言三、Python是什么四、Python有哪些优点和缺点？ 4.1 Python的优点 4.2 Python 的缺点五、学Python能干什么，Python的...
如何修改传统的笛卡尔积以减少内存开销？ php
2015-05-14 13:25

回答 1 已采纳 I solved my issue with memory by performing a depth first cartesian product. I can weigh the solu
SpringBoot DevTools 的用途是什么？ java
2021-09-24 12:03

回答 2 已采纳 dev 是开发的简称；；tools 是工具，合在一起，就是开发工具，，，可以简单的理解为方便在开发时使用的工具。
在OOP中调用函数是否很慢还是开销？ [重复] php
2013-05-02 17:29

回答 4 已采纳 You are touching the very old debate between making a one large query to get your data, or looping
如何用Go实现一个异步网络库？
2022-07-19 18:02

腾讯云开发者的博客导语|在需要高性能、节省资源的场景下，比如海量的连接、很高的并发，我们发现Go开始变得吃力，不但内存开销大，而且还会有频繁的goroutine调度。GC时间也变得越来越长，甚至还会把系统搞挂。这时，我们就可以考虑用...
使用 Spring Cloud 有什么优势？ java
2023-01-27 12:34

回答 2 已采纳世界是对立的，凡事都有两面性，有好的一面就有不好的一面，只要符合你的需要就是好的，不必过于纠结它的缺点。个人觉得springcloud最大的一个优势是服务拆分粒度更细，有利于资源重复利用，有利于提高
为什么简单的Go应用会消耗大量内存
2019-03-14 15:26

回答 1 已采纳 Launched goroutines run concurrently, independent of each other. It's the responsibility and duty
sql server IO开销问题 sql
2019-06-09 16:46

回答 1 已采纳 450M/S确实很高了，可以考虑使用NVME接口的SSD硬盘，可以达到1000M/s以上的传输率和十万级别的IOPS
go必知必会
2021-11-30 00:06

小卒曹阿瞒的博客同样victim也是一个poolLocal数组的指针，每次垃圾回收的时候，Pool 会把 victim 中的对象移除，然后把 local 的数据给 victim。 poolLocal数组的大小就是p的数量，受runtime.GOMAXPROCS(0)决定，下标对应
关于反射创建中使用单例如果操作？我用了静态工厂，然后通过返回创建对象，但是反射创建对象开销比较大。 java 单例模式开发语言
2022-09-18 14:35

回答 1 已采纳你所指的开销性能大是占用资源还是加载对象缓慢，若是缓慢采用饿加载，在对象使用前就创建好。
深入剖析对 Go 的成功作出巨大贡献的设计决策（深度好文）
2022-05-07 12:00

Go中国的博客事实上，Go 不允许导入未被使用的包，以避免将未使用的代码链接到程序里而造成的不必要的膨胀。导入路径是带引号的字符串文字，这样可以灵活地对其进行解释。斜杠分隔的路径在 import 中标识了导入的包，但是随后源...
高性能Golang研讨会【精】
2019-09-26 19:35

dianfu2892的博客本次研讨会的目标是为您提供诊断Go应用程序中的性能问题并进行修复所需的工具。通过这一天，我们将从小工作 - 学习如何编写基准，然后分析一小段代码。然后走出去讨论执行跟踪器，垃圾收集器和跟踪运行的应用...
没有解决我的问题, 去提问

悬赏问题

¥15 我想咨询一下路面纹理三维点云数据处理的一些问题，上传的坐标文件里是怎么对无序点进行编号的，以及xy坐标在处理的时候是进行整体模型分片处理的吗
¥15 CSAPPattacklab
¥15 一直显示正在等待HID—ISP
¥15 Python turtle 画图
¥15 关于大棚监测的pcb板设计
¥15 stm32开发clion时遇到的编译问题
¥15 lna设计源简并电感型共源放大器
¥15 如何用Labview在myRIO上做LCD显示？(语言-开发语言)
¥15 Vue3地图和异步函数使用
¥15 C++ yoloV5改写遇到的问题

什么会造成goroutines的巨大开销？

1条回答 默认 最新

悬赏问题

1条回答默认最新