I've written a simple crawler that looks something like this:
type SiteData struct {
// ...
}
func downloadURL(url string) (body []byte, status int) {
resp, err := http.Get(url)
if err != nil {
return
}
status = resp.StatusCode
defer resp.Body.Close()
body, err = ioutil.ReadAll(resp.Body)
body = bytes.Trim(body, "\x00")
return
}
func processSiteData(resp []byte) SiteData {
// ...
}
func worker(input chan string, output chan SiteData) {
// wait on the channel for links to process
for url := range input {
// fetch the http response and status code
resp, status := downloadURL(url)
if resp != nil && status == 200 {
// if no errors in fetching link
// process the data and send
// it back
output <- processSiteData(resp)
} else {
// otherwise send the url for processing
// once more
input <- url
}
}
}
func crawl(urlList []string) {
numWorkers := 4
input := make(chan string)
output := make(chan SiteData)
// spawn workers
for i := 0; i < numWorkers; i++ {
go worker(input, output)
}
// enqueue urls
go func() {
for url := range urlList {
input <- url
}
}()
// wait for the results
for {
select {
case data := <-output:
saveToDB(data)
}
}
}
func main() {
urlList := loadLinksFromDB()
crawl(urlList)
}
It scrapes a single website and works great - downloading data, processing it and saving it to a database. Yet after a few minutes (5-10) or so it gets "stuck" and needs to be restarted. The site isn't blacklisting me, I've verified with them and can access any url at any time after the program blocks. Also, it blocks before all the urls are done processing. Obviously it'll block when the list is spent, but it is nowhere near that.
Am I doing something wrong here? The reason I'm using for { select { ... } }
instead of for _, _ = range urlList { // read output }
is that any url can be re-enqueued if failed to process. In addition, the database doesn't seem to be the issue here as well. Any input will help - thanks.