dongli8862 2017-03-29 10:29
浏览 115

网页爬虫卡住了

I'm new to Go and trying to implement a web crawler. It should asynchronously parse web pages and save their contents to files, one file per new page. But it gets stuck after I've added

u, _ := url.Parse(uri)
fileName := u.Host + u.RawQuery + ".html"
body, err := ioutil.ReadAll(resp.Body)
writes <- writer{fileName: fileName, body: body}

Can anyone help me fix this problem? Basically I want to get data from the response body, push it to the channel, and then get data from the channel and put it into a file. It looks like the writes channel was not initialized, and sending on a nil channel blocks forever.

package main

import (
    "crypto/tls"
    "flag"
    "fmt"
    "io/ioutil"
    "net/http"
    "net/url"
    "os"
    "runtime"

    "./linksCollector"
)

type writer struct {
    fileName string
    body     []byte
}

var writes = make(chan writer)

func usage() {
    fmt.Fprintf(os.Stderr, "usage: crawl http://example.com/")
    flag.PrintDefaults()
    os.Exit(2)
}

func check(e error) {
    if e != nil {
        panic(e)
    }
}

func main() {
    runtime.GOMAXPROCS(8)
    flag.Usage = usage
    flag.Parse()

    args := flag.Args()
    fmt.Println(args)
    if len(args) < 1 {
        usage()
        fmt.Println("Please specify start page")
        os.Exit(1)
    }

    queue := make(chan string)
    filteredQueue := make(chan string)

    go func() { queue <- args[0] }()
    go filterQueue(queue, filteredQueue)

    for uri := range filteredQueue {
        go enqueue(uri, queue)
    }

    for {
        select {
        case data := <-writes:
            f, err := os.Create(data.fileName)
            check(err)
            defer f.Close()
            _, err = f.Write(data.body)
            check(err)
        }
    }
}

func filterQueue(in chan string, out chan string) {
    var seen = make(map[string]bool)
    for val := range in {
        if !seen[val] {
            seen[val] = true
            out <- val
        }
    }
}

func enqueue(uri string, queue chan string) {
    fmt.Println("fetching", uri)
    transport := &http.Transport{
        TLSClientConfig: &tls.Config{
            InsecureSkipVerify: true,
        },
    }
    client := http.Client{Transport: transport}
    resp, err := client.Get(uri)
    check(err)

    defer resp.Body.Close()

    u, _ := url.Parse(uri)
    fileName := u.Host + u.RawQuery + ".html"
    body, err := ioutil.ReadAll(resp.Body)
    writes <- writer{fileName: fileName, body: body}

    links := collectlinks.All(resp.Body)

    for _, link := range links {
        absolute := fixURL(link, uri)
        if uri != "" {
            go func() { queue <- absolute }()
        }
    }
}

func fixURL(href, base string) string {
    uri, err := url.Parse(href)
    if err != nil {
        return ""
    }
    baseURL, err := url.Parse(base)
    if err != nil {
        return ""
    }
    uri = baseURL.ResolveReference(uri)
    return uri.String()
}
  • 写回答

1条回答 默认 最新

  • dongruidian3064 2017-03-29 11:06
    关注

    Your for loop ends up calling go enqueue more than once before the select receives the data causing the send to writes to crash the program, I think, I'm not really that familiar with Go's concurrency.

    Update: I'm sorry for the previous answer, it was a poorly informed attempt at explaining something I have only limited knowledge about. After taking a closer look I am almost certain of two things. 1. Your writes channel is not nil, you can rely on make to initilize your channels. 2. A range loop over a channel will block until that channel is closed. So your

    for uri := range filteredQueue {
        go enqueue(uri, queue)
    }
    

    is blocking, therefore your program never reaches the select and so is unable to receive from the writes channel. You can avoid this by executing the range loop in a new goroutine.

    go func() {
        for uri := range filteredQueue {
            go enqueue(uri, queue)
        }
    }()
    

    Your program, as is, will still break for other reasons but you should be able to fix that with a little synchronization using a sync.WaitGroup. Here's a simplified example: https://play.golang.org/p/o2Oj4g8c2y.

    评论

报告相同问题?

悬赏问题

  • ¥15 微信会员卡接入微信支付商户号收款
  • ¥15 如何获取烟草零售终端数据
  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?