douhu1990 2017-02-21 20:30
浏览 47
已采纳

如何检测到什么阻止了在golang中使用多个内核?

So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.

There are two large vectors with input/output values

var (
    input = make([]float64, rowCount)
    output = make([]float64, rowCount)
)

these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:

var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
    go func(id int) {
         var wd float64
         // eg nw = 4
         // worker0, i = 0, 4, 8, 12...
         // worker1, i = 1, 5, 9, 13...
         // worker2, i = 2, 6, 10, 14...
         // worker3, i = 3, 7, 11, 15...
         for i := id; i < rowCount; i += nw {
             res := compute(input[i])
             wd += distance(res, output[i])
         }
         ch <- wd
    }(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
    d += <-ch
}

The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.

The problem I'm having is that this code is no faster than the serial code.

Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.

  • distance is just (a-b)*(a-b);
  • compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
  • no other goroutine is running.

So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).

I also compiled with -race and nothing emerged.

My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.

I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.

How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?

Thanks in advance

  • 写回答

1条回答 默认 最新

  • duanbi5906 2017-02-22 15:30
    关注

    Sorry, but in the end I got the measurement wrong. @JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.

    My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.

    After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.

    Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 有人会搭建GPT-J-6B框架吗?有偿
  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名