I have written a multi-goroutine version's mergeSort by go, and I also wrote a benchmark test. Now I want to use "go tool pprof" to analyze the bottleneck of my code.
When I got the cpu profile, I use "top10" in pprof to get the following output:
Showing nodes accounting for 4.73s, 98.54% of 4.80s total
Dropped 21 nodes (cum <= 0.02s)
Showing top 10 nodes out of 30
flat flat% sum% cum cum%
3.66s 76.25% 76.25% 3.66s 76.25% pingcap/talentplan/tidb/common/alg/sort.Merge
0.62s 12.92% 89.17% 0.64s 13.33% pingcap/talentplan/tidb/mergesort.prepare
0.17s 3.54% 92.71% 0.17s 3.54% runtime.freedefer
0.12s 2.50% 95.21% 0.14s 2.92% pingcap/talentplan/tidb/common/alg/sort.quickSort
0.10s 2.08% 97.29% 0.10s 2.08% runtime.memmove
0.03s 0.62% 97.92% 0.03s 0.62% runtime.memclrNoHeapPointers
0.03s 0.62% 98.54% 0.04s 0.83% runtime.stackpoolalloc
0 0% 98.54% 0.11s 2.29% pingcap/talentplan/tidb/common/alg/sort.MergeSortByMultiGoroutine
0 0% 98.54% 0.14s 2.92% pingcap/talentplan/tidb/common/alg/sort.QuickSort
0 0% 98.54% 4.04s 84.17% pingcap/talentplan/tidb/common/alg/sort.mergeSortByMultiGoroutine
From the above, I think that the curious bottleneck is in sort.Merge, so I use "list Merge" to dive into the method, and I find the following info:
. . 50:func Merge(arr []int64, start int, mid int, end int, tmp []int64) {
. . 51: index, i, j := start, start, mid + 1
80ms 80ms 52: for ; i <= mid && j <= end; index++ {
180ms 180ms 53: if arr[i] <= arr[j] {
1.58s 1.58s 54: tmp[index] = arr[i]
50ms 50ms 55: i++
. . 56: } else {
1.52s 1.52s 57: tmp[index] = arr[j]
20ms 20ms 58: j++
. . 59: }
. . 60: }
. . 61: for ; i <= mid; index++ {
. . 62: tmp[index] = arr[i]
. . 63: i++
. . 64: }
. . 65: for ; j <= end; index++ {
. . 66: tmp[index] = arr[j]
. . 67: j++
. . 68: }
. . 69:
60ms 60ms 70: for i := start; i <= end; i++ {
170ms 170ms 71: arr[i] = tmp[i]
. . 72: }
. . 73:}
What confuse me is here! In the Merge method, there has 4 for-loop. The 1st for-loop and the 4th for-loop's scale is approximately the same, and both of them's task are moving elements from a slice to another slice. And the problem is that why the 1st for-loop cost so much(1.58s plus 1.52s), but the 4th for-loop cost so few(just 170ms)? It's counter-intuitive!
This project's github address is https://github.com/Duncan15/talent-plan/tree/master/tidb/mergesort. You can use "make pprof" to run the benchmark test and get the cpu profile and memory profile.
I want to know why this happen, if you have time, please read my code and give me some advise.
thank you for telling me!!!
I have write some code to verify that when the Merge method run in single-goroutine environment the 1st for-loop's cost is approximately the same with the 4th-for-loop, this seems intuitive. So I think whether the multi-goroutine environment cause the above phenomenon. But in multi-goroutine environment, the Merge method run concurrently. In other word, the 1st for-loop and 4th for-loop run concurrently, if read and write slice concurrently would increase the cost, the 4th for-loop's cost must also increase, but from the pprof output we can find that only the 1st for-loop's cost increase!
And I also write another test to verify my thought. You can use "make vet" to run it. this test run the Merge method concurrently. The difference with the multi-goroutine version's mergeSort is that this test has no code about sorting, just merely code about merging. And I surprisingly find that in this test the 1st for-loop's cost is approximately the same with the 4th for-loop's. So finally I am confused completely!