dongwuxie5112 2019-04-24 16:29
浏览 69

为什么在多goroutine中读写切片会花费很多呢?

I have written a multi-goroutine version's mergeSort by go, and I also wrote a benchmark test. Now I want to use "go tool pprof" to analyze the bottleneck of my code.

When I got the cpu profile, I use "top10" in pprof to get the following output:

Showing nodes accounting for 4.73s, 98.54% of 4.80s total
Dropped 21 nodes (cum <= 0.02s)
Showing top 10 nodes out of 30
      flat  flat%   sum%        cum   cum%
     3.66s 76.25% 76.25%      3.66s 76.25%  pingcap/talentplan/tidb/common/alg/sort.Merge
     0.62s 12.92% 89.17%      0.64s 13.33%  pingcap/talentplan/tidb/mergesort.prepare
     0.17s  3.54% 92.71%      0.17s  3.54%  runtime.freedefer
     0.12s  2.50% 95.21%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.quickSort
     0.10s  2.08% 97.29%      0.10s  2.08%  runtime.memmove
     0.03s  0.62% 97.92%      0.03s  0.62%  runtime.memclrNoHeapPointers
     0.03s  0.62% 98.54%      0.04s  0.83%  runtime.stackpoolalloc
         0     0% 98.54%      0.11s  2.29%  pingcap/talentplan/tidb/common/alg/sort.MergeSortByMultiGoroutine
         0     0% 98.54%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.QuickSort
         0     0% 98.54%      4.04s 84.17%  pingcap/talentplan/tidb/common/alg/sort.mergeSortByMultiGoroutine

From the above, I think that the curious bottleneck is in sort.Merge, so I use "list Merge" to dive into the method, and I find the following info:

     .          .     50:func Merge(arr []int64, start int, mid int, end int, tmp []int64)  {
     .          .     51:   index, i, j := start, start, mid + 1
  80ms       80ms     52:   for ; i <= mid && j <= end; index++ {
 180ms      180ms     53:           if arr[i] <= arr[j] {
 1.58s      1.58s     54:                   tmp[index] = arr[i]
  50ms       50ms     55:                   i++
     .          .     56:           } else {
 1.52s      1.52s     57:                   tmp[index] = arr[j]
  20ms       20ms     58:                   j++
     .          .     59:           }
     .          .     60:   }
     .          .     61:   for ; i <= mid; index++ {
     .          .     62:           tmp[index] = arr[i]
     .          .     63:           i++
     .          .     64:   }
     .          .     65:   for ; j <= end; index++ {
     .          .     66:           tmp[index] = arr[j]
     .          .     67:           j++
     .          .     68:   }
     .          .     69:
  60ms       60ms     70:   for i := start; i <= end; i++ {
 170ms      170ms     71:           arr[i] = tmp[i]
     .          .     72:   }
     .          .     73:}

What confuse me is here! In the Merge method, there has 4 for-loop. The 1st for-loop and the 4th for-loop's scale is approximately the same, and both of them's task are moving elements from a slice to another slice. And the problem is that why the 1st for-loop cost so much(1.58s plus 1.52s), but the 4th for-loop cost so few(just 170ms)? It's counter-intuitive!

This project's github address is https://github.com/Duncan15/talent-plan/tree/master/tidb/mergesort. You can use "make pprof" to run the benchmark test and get the cpu profile and memory profile.

I want to know why this happen, if you have time, please read my code and give me some advise.

thank you for telling me!!!

I have write some code to verify that when the Merge method run in single-goroutine environment the 1st for-loop's cost is approximately the same with the 4th-for-loop, this seems intuitive. So I think whether the multi-goroutine environment cause the above phenomenon. But in multi-goroutine environment, the Merge method run concurrently. In other word, the 1st for-loop and 4th for-loop run concurrently, if read and write slice concurrently would increase the cost, the 4th for-loop's cost must also increase, but from the pprof output we can find that only the 1st for-loop's cost increase!

And I also write another test to verify my thought. You can use "make vet" to run it. this test run the Merge method concurrently. The difference with the multi-goroutine version's mergeSort is that this test has no code about sorting, just merely code about merging. And I surprisingly find that in this test the 1st for-loop's cost is approximately the same with the 4th for-loop's. So finally I am confused completely!

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥20 wireshark抓不到vlan
    • ¥20 关于#stm32#的问题:需要指导自动酸碱滴定仪的原理图程序代码及仿真
    • ¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持
    • ¥15 stata安慰剂检验作图但是真实值不出现在图上
    • ¥15 c程序不知道为什么得不到结果
    • ¥40 复杂的限制性的商函数处理
    • ¥15 程序不包含适用于入口点的静态Main方法
    • ¥15 素材场景中光线烘焙后灯光失效
    • ¥15 请教一下各位,为什么我这个没有实现模拟点击
    • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来