2015-03-20 16:29

# 在并行quicksort实现中使用go例程时，性能较差

Note: The "Go-lang parallel segment runs slower than series segment" question dealt with race conditions, this one has another issue, so imho it's not a duplicate.

I'm trying to find an explanation for the following situation: Running parallel quicksort results in a significantly longer runtime when done using go routines.

Benchmarks are after the code:

``````package c9sort

import (
"time"
)

var runInParllel bool

func Quicksort(nums []int, parallel bool) ([]int, int) {
started := time.Now()
ch := make(chan int)
runInParllel = parallel

go quicksort(nums, ch)

sorted := make([]int, len(nums))
i := 0
for next := range ch {
sorted[i] = next
i++
}
return sorted, int(time.Since(started).Nanoseconds() / 1000000)
}

func quicksort(nums []int, ch chan int) {

// Choose first number as pivot
pivot := nums[0]

// Prepare secondary slices
smallerThanPivot := make([]int, 0)
largerThanPivot := make([]int, 0)

// Slice except pivot
nums = nums[1:]

// Go over slice and sort
for _, i := range nums {
switch {
case i <= pivot:
smallerThanPivot = append(smallerThanPivot, i)
case i > pivot:
largerThanPivot = append(largerThanPivot, i)
}
}

var ch1 chan int
var ch2 chan int

// Now do the same for the two slices
if len(smallerThanPivot) > 1 {
ch1 = make(chan int, len(smallerThanPivot))
if runInParllel {
go quicksort(smallerThanPivot, ch1)
} else {
quicksort(smallerThanPivot, ch1)
}
}
if len(largerThanPivot) > 1 {
ch2 = make(chan int, len(largerThanPivot))
if runInParllel {
go quicksort(largerThanPivot, ch2)
} else {
quicksort(largerThanPivot, ch2)
}
}

// Wait until the sorting finishes for the smaller slice
if len(smallerThanPivot) > 1 {
for i := range ch1 {
ch <- i
}
} else if len(smallerThanPivot) == 1 {
ch <- smallerThanPivot[0]
}
ch <- pivot

if len(largerThanPivot) > 1 {
for i := range ch2 {
ch <- i
}
} else if len(largerThanPivot) == 1 {
ch <- largerThanPivot[0]
}

close(ch)
}
``````

Benchmarks for a random perm of 500000 integers:

Ran 100 times

Non parallel average - 1866ms

Parallel average - 2437ms

Any explanation would be appreciated. I know goroutines may not be best for this kind of parallelism, but I'm trying to understand the reason.

• 写回答
• 好问题 提建议
• 关注问题
• 收藏
• 邀请回答

#### 2条回答默认 最新

• dougang1967 2015-03-21 05:05
已采纳

Turns out it was very simple. As I'm on a new machine, the GOMAXPROCS variable wasn't set.

The new benchmark favors, as predicted, the parallel implementation: Set to double the number of cores:

``````Using 16 goroutines
Ran 100 times

Non parallel average - 1980
Parallel average - 1133
``````

Set to the number of cores:

``````Using 8 goroutines
Ran 100 times

Non parallel average - 2004
Parallel average - 1197
``````

By the way, this is fairly consistent. The average for double the number of cores is always a bit better.

Benchmark for a larger collection (1000000):

``````Using 8 goroutines
Ran 100 times

Non parallel average - 3748
Parallel average - 2265
``````

With double:

``````Using 16 goroutines
Ran 100 times

Non parallel average - 3817
Parallel average - 2012
``````
已采纳该答案
评论
解决 无用
打赏 举报
• dqspy04266 2015-03-20 18:56

The general answer is that inter-thread coordination has a cost, so sending a task off to another goroutine can only speed things up if the task is at least a certain size. So don't send single items.

For a divide-and-conquer algorithm like quicksort, the ways to parallelize can be interesting. Generally: when you recurse, you can start the "sort one half of the array" task in a goroutine if it's large enough. The "if it's large enough" part is how you reduce coordination overhead.

In more detail, that looks like:

1. write a non-parallel `qsort(data []int)`

2. change `qsort` to optionally take a `*sync.WaitGroup` we'll call `wg` for signaling that it's done. Add `if wg != nil { wg.Done() }` before each of its `return`s.

3. where `qsort` recurses, have it check whether the half of the data it's going to sort is large-ish (say, over 500 items?), and

• if it's large, start another task: `wg.Add(1); go qsort(half, wg)`
• if it's not, sort here and now: `qsort(half, nil)`
4. wrap `qsort` in to hide the parallel machinery from the user: like, have `quicksort(data)` do `wg := new(sync.WaitGroup)`, do an initial `wg.Add(1)`, call `qsort(data, wg)`, and do `wg.Wait()` to wait for all of the background sorts to complete.

This isn't an optimal strategy, because it keeps forking off new goroutines even after there are enough tasks to keep your cores busy. And there is a lot of tuning one could do around just how you hand off some tasks to the background. But the big point is, using another thread only for large subtasks is a way to parallelize quicksort without getting smashed by coordination overhead.

Here's a demo with an embarrassingly slapdash quicksort (you have run it locally to get the timing)--good chance of bugs, definitely is quadratic on already-sorted data; the point is to get the idea of parallel divide-and-conquer across.

There is also a bottom-up strategy--sort pieces then merge--used in, for instance, this parallel sort package. The in-place merge code that package uses is tricky, though.

(And if you haven't set GOMAXPROCS, set it with something like `runtime.GOMAXPROCS(runtime.NumCPU())` in `main`.)

Finally, look at Go's sort package. There's a lot of code, but it's pretty clear, and once you get it you can do your experiments with a "real", fleshed-out sort implementation.

评论
解决 无用
打赏 举报