As part of my Go tutorial, I'm writing simple program counting words across multiple files. I have a few go routines for processing files and creating map[string]int
telling how many occurrence of particular word have been found. The map is then being sent to reducing routine, which aggregates values to a single map. Sounds pretty straightforward and looks like a perfect (map-reduce) task for Go!
I have around 10k document with 1.6 million unique words. What I found is my memory usage is growing fast and constantly while running the code and I'm running out of memory at about half way of processing (12GB box, 7GB free). So yes, it uses gigabytes for this small data set!
Trying to figure out where the problem lies, I found the reducer collecting and aggregating data is to blame. Here comes the code:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
total[w] += c
}
}
output <- len(total)
}
If I remove the map from the sample above the memory stays within reasonable limits (a few hundred megabytes). What I found though, is taking copy of a string also solves the problem, i.e. the following sample doesn't eat up my memory:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
copyW := make([]byte, len(w)) // <-- will put a copy here!
copy(copyW, w)
total[string(copyW)] += c
}
}
output <- len(total)
}
Is it possible it's a wordMap
instance not being destructed after every iteration when I use the value directly? (As a C++ programmer I have limited intuition when comes to GC.) Is it desirable behaviour? Am I doing something wrong? Should I be disappointed with Go or rather with myself?
Thanks!