I'd like to distribute some load across some goroutines. If the number of tasks is known beforehand then it is easy to organize. For example, I could do fan out with a wait group.
nTasks := 100
nGoroutines := 10
// it is important that this channel is not buffered
ch := make(chan *Task)
done := make(chan bool)
var w sync.WaitGroup
// Feed the channel until done
go func () {
for i:= 0; i < nTasks; i++ {
task := getTaskI(i)
ch <- task
}
// as ch is not buffered once everything is read we know we have delivered all of them
for i:=0; i < nGoroutines; i++ {
done <- false
}
}()
for i:= 0; i < nGoroutines; i ++ {
w.Add(1)
go func () {
defer w.Done()
select {
case task := <-ch:
doSomethingWithTask(task)
case <- done:
return
}
}()
}
w.Wait()
// All tasks done, all goroutines closed
However, in my case each task returns more tasks to be done. Say for example a crawler where we receive all the links from the crawled web. My initial hunch was to have a main loop where I track the number of tasks done and tasks pending. When I'm done I send a finish signal to all goroutines:
nGoroutines := 10
ch := make(chan *Task, nGoroutines)
feedBackChannel := make(chan * Task, nGoroutines)
done := make(chan bool)
for i:= 0; i < nGoroutines; i ++ {
go func () {
select {
case task := <-ch:
task.NextTasks = doSomethingWithTask(task)
feedBackChannel <- task
case <- done:
return
}
}()
}
// seed first task
ch <- firstTask
nTasksRemaining := 1
for nTasksRemaining > 0 {
task := <- feedBackChannel
nTasksRemaining -= 1
for _, t := range(task.NextTasks) {
ch <- t
nTasksRemaining++
}
}
for i:=0; i < nGoroutines; i++ {
done <- false
}
However, this produces a deadlock. For example if NextTasks is bigger than the number of goroutines then the main loop will stall when the first tasks finish. But the first tasks can't finish because the feedBack is blocked since the mainLoop is waiting to write.
One "easy" way out of this is to post to the channel asynchronously:
Instead of doing feedBackChannel <- task
do go func () {feedBackChannel <- task}()
. Now, this feels like an awful hack. Specially since there might be hundred of thousands of tasks.
What would be a nice way to avoid this deadlock? I've searched for concurrency patterns, but mostly are simpler things like fanning out or pipelines where the later stage does not affect the earlier steps.