doubushi0031
doubushi0031
2014-04-06 08:46

几分钟后,Go搜寻器就会停止从输出通道中进行选择

  • channel
  • goroutine
已采纳

I've written a simple crawler that looks something like this:

type SiteData struct {
    // ...
}

func downloadURL(url string) (body []byte, status int) {
    resp, err := http.Get(url)

    if err != nil {
        return
    }

    status = resp.StatusCode
    defer resp.Body.Close()

    body, err = ioutil.ReadAll(resp.Body)
    body = bytes.Trim(body, "\x00")

    return
}


func processSiteData(resp []byte) SiteData {
    // ...
}    

func worker(input chan string, output chan SiteData) {

    // wait on the channel for links to process
    for url := range input {

        // fetch the http response and status code
        resp, status := downloadURL(url)

        if resp != nil && status == 200 {
            // if no errors in fetching link
            // process the data and send 
            // it back
            output <- processSiteData(resp)
        } else {
            // otherwise send the url for processing
            // once more
            input <- url
        }
    }
}

func crawl(urlList []string) {
    numWorkers := 4
    input := make(chan string)
    output := make(chan SiteData)

    // spawn workers
    for i := 0; i < numWorkers; i++ {
        go worker(input, output)
    }

    // enqueue urls
    go func() {
        for url := range urlList {
            input <- url
        }
    }()

    // wait for the results
    for {
        select {
        case data := <-output:
            saveToDB(data)
        }
    }

}

func main() {
    urlList := loadLinksFromDB()
    crawl(urlList)
}

It scrapes a single website and works great - downloading data, processing it and saving it to a database. Yet after a few minutes (5-10) or so it gets "stuck" and needs to be restarted. The site isn't blacklisting me, I've verified with them and can access any url at any time after the program blocks. Also, it blocks before all the urls are done processing. Obviously it'll block when the list is spent, but it is nowhere near that.

Am I doing something wrong here? The reason I'm using for { select { ... } } instead of for _, _ = range urlList { // read output } is that any url can be re-enqueued if failed to process. In addition, the database doesn't seem to be the issue here as well. Any input will help - thanks.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • doukuiqian9911 doukuiqian9911 7年前

    I believe this hangs when you have all N workers waiting on input <- url, and hence there are no more workers taking stuff out of input. In other words, if 4 URLs fail roughly at the same time, it will hang.

    The solution is to send failed URLs to some place that is not the input channel for the workers (to avoid deadlock).

    One possibility is to have a separate failed channel, with the anonymous goroutine always accepting input from it. Like this (not tested):

    package main
    
    func worker(intput chan string, output chan SiteData, failed chan string) {
        for url := range input {
            // ...
            if resp != nil && status == 200 {
                output <- processSideData(resp)
            } else {
                failed <- url
            }
        }
    }
    
    func crawl(urlList []string) {
        numWorkers := 4
        input := make(chan string)
        failed := make(chan string)
        output := make(chan SiteData)
    
        // spawn workers
        for i := 0; i < numWorkers; i++ {
            go worker(input, output, failed)
        }
    
        // Dispatch URLs to the workers, also receive failures from them.
        go func() {
            for {
                select {
                case input <- urlList[0]:
                    urlList = urlList[1:]
                case url := <-failed:
                    urlList = append(urlList, url)
                }
            }
        }()
    
        // wait for the results
        for {
            data := <-output
            saveToDB(data)
        }
    }
    
    func main() {
        urlList := loadLinksFromDB()
        crawl(urlList)
    }
    

    (Note how it is correct, as you say in your commentary, not to use for _, _ = range urlList { // read output } in your crawl() function, because URLs can be re-enqueued; but you don’t need select either as far as I can tell.)

    点赞 评论 复制链接分享