I'm trying to make a web scraper, which can run a decent number (many thousands) of http queries per minute. The actual querying is fine but to speed up the process. I'm trying to make it concurrent. Initially I spawned a goroutine for each request but I ran out of file descriptors so after some googling I decided to use a semaphore to limit the number of concurrent goroutines.
Only I can't get this to work.
I've tried moving bits of code around but I always have the same issue: I have roughly three times as many goroutines running as I want
This is the only method I have that spawns goroutines. I limited the goroutines to 80. In my benchmarks I run this against a slice of 10000 URLs and it tends to hover at about 242 concurrent goroutines in flight, but then it suddenly goes up to almost double this and then back down to 242.
I get the same behaviour if I change the concurrent value from 80 - it usually hovers at just over three times the number of goroutines and sometimes spikes to around double that and I have no idea why.
func (B BrandScraper) ScrapeUrls(URLs ...string) []scrapeResponse {
concurrent := 80
semaphoreChan := make(chan struct{}, concurrent)
scrapeResults := make([]scrapeResponse, len(URLs))
for _, URL := range URLs {
semaphoreChan <- struct{}{}
go func(URL string) {
defer func() {
<-semaphoreChan
}()
scrapeResults = append(scrapeResults,
B.getIndividualScrape(URL))
fmt.Printf("#goroutines: %d
", runtime.NumGoroutine())
}(URL)
}
return scrapeResults
}
I'm expecting it to be constantly at 80 goroutines - or at least constant.
This happens when I run it from a benchmarking test or when i run it from the main function.
Thanks very much for any tips!
EDIT
getIndividualScrape
calls another function:
func (B BrandScraper) doGetRequest(URL string) io.Reader {
resp, err := http.Get(URL)
if err != nil {
log.Fatal(err)
}
body, _ := ioutil.ReadAll(resp.Body)
resp.Body.Close()
return bytes.NewReader(body)
}
which obviously does an HTTP request. Could this be leaking goroutines? I thought since I'd closed the resp.Body
I'd have covered that but maybe not?