doubushi0031 2014-04-06 08:46
浏览 16
已采纳

几分钟后,Go搜寻器就会停止从输出通道中进行选择

I've written a simple crawler that looks something like this:

type SiteData struct {
    // ...
}

func downloadURL(url string) (body []byte, status int) {
    resp, err := http.Get(url)

    if err != nil {
        return
    }

    status = resp.StatusCode
    defer resp.Body.Close()

    body, err = ioutil.ReadAll(resp.Body)
    body = bytes.Trim(body, "\x00")

    return
}


func processSiteData(resp []byte) SiteData {
    // ...
}    

func worker(input chan string, output chan SiteData) {

    // wait on the channel for links to process
    for url := range input {

        // fetch the http response and status code
        resp, status := downloadURL(url)

        if resp != nil && status == 200 {
            // if no errors in fetching link
            // process the data and send 
            // it back
            output <- processSiteData(resp)
        } else {
            // otherwise send the url for processing
            // once more
            input <- url
        }
    }
}

func crawl(urlList []string) {
    numWorkers := 4
    input := make(chan string)
    output := make(chan SiteData)

    // spawn workers
    for i := 0; i < numWorkers; i++ {
        go worker(input, output)
    }

    // enqueue urls
    go func() {
        for url := range urlList {
            input <- url
        }
    }()

    // wait for the results
    for {
        select {
        case data := <-output:
            saveToDB(data)
        }
    }

}

func main() {
    urlList := loadLinksFromDB()
    crawl(urlList)
}

It scrapes a single website and works great - downloading data, processing it and saving it to a database. Yet after a few minutes (5-10) or so it gets "stuck" and needs to be restarted. The site isn't blacklisting me, I've verified with them and can access any url at any time after the program blocks. Also, it blocks before all the urls are done processing. Obviously it'll block when the list is spent, but it is nowhere near that.

Am I doing something wrong here? The reason I'm using for { select { ... } } instead of for _, _ = range urlList { // read output } is that any url can be re-enqueued if failed to process. In addition, the database doesn't seem to be the issue here as well. Any input will help - thanks.

  • 写回答

1条回答 默认 最新

  • doukuiqian9911 2014-04-06 17:35
    关注

    I believe this hangs when you have all N workers waiting on input <- url, and hence there are no more workers taking stuff out of input. In other words, if 4 URLs fail roughly at the same time, it will hang.

    The solution is to send failed URLs to some place that is not the input channel for the workers (to avoid deadlock).

    One possibility is to have a separate failed channel, with the anonymous goroutine always accepting input from it. Like this (not tested):

    package main
    
    func worker(intput chan string, output chan SiteData, failed chan string) {
        for url := range input {
            // ...
            if resp != nil && status == 200 {
                output <- processSideData(resp)
            } else {
                failed <- url
            }
        }
    }
    
    func crawl(urlList []string) {
        numWorkers := 4
        input := make(chan string)
        failed := make(chan string)
        output := make(chan SiteData)
    
        // spawn workers
        for i := 0; i < numWorkers; i++ {
            go worker(input, output, failed)
        }
    
        // Dispatch URLs to the workers, also receive failures from them.
        go func() {
            for {
                select {
                case input <- urlList[0]:
                    urlList = urlList[1:]
                case url := <-failed:
                    urlList = append(urlList, url)
                }
            }
        }()
    
        // wait for the results
        for {
            data := <-output
            saveToDB(data)
        }
    }
    
    func main() {
        urlList := loadLinksFromDB()
        crawl(urlList)
    }
    

    (Note how it is correct, as you say in your commentary, not to use for _, _ = range urlList { // read output } in your crawl() function, because URLs can be re-enqueued; but you don’t need select either as far as I can tell.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 seatunnel-web使用SQL组件时候后台报错,无法找到表格
  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)
  • ¥15 用前端向数据库插入数据,通过debug发现数据能走到后端,但是放行之后就会提示错误
  • ¥30 3天&7天&&15天&销量如何统计同一行
  • ¥30 帮我写一段可以读取LD2450数据并计算距离的Arduino代码
  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题
  • ¥20 云服务Linux系统TCP-MSS值修改?
  • ¥20 关于#单片机#的问题:项目:使用模拟iic与ov2640通讯环境:F407问题:读取的ID号总是0xff,自己调了调发现在读从机数据时,SDA线上并未有信号变化(语言-c语言)
  • ¥20 怎么在stm32门禁成品上增加查询记录功能