dongqufi82315 2017-12-31 13:47
浏览 61

Web爬虫浏览不同的输出相同的代码

I've been trying to do all the tour of go tutorials and I'm stuck at the web crawler. I thought I finished it, but the output is inconsistent and I don't have enough concurrency experience to figure out why.

Here's my code:

package main

import (
    "fmt"
    "sync"
)

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}
var cache = struct {
    fetched map[string]bool
    sync.Mutex
}{fetched: make(map[string]bool)}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, c chan []string, quit chan int) {
    if depth <= 0 {
        return
    }
    go safeVisit(url, c, quit, fetcher)
    for {
        select {
        case <- quit:
            return
        case u:= <-c:
            for _, v:= range u {
                go Crawl(v, depth -1, fetcher, c, quit)
            }
        }
    }
}
func main() {
    c := make(chan []string)
    quit := make(chan int)
    Crawl("http://golang.org/", 4, fetcher, c, quit)
}

func safeVisit(url string, c chan []string, quit chan int, fetcher Fetcher) {
    cache.Lock()
    defer cache.Unlock()
    if _, ok := cache.fetched[url] ; ok {
        quit <- 0
        return
    }
    body, urls, err := fetcher.Fetch(url)
    cache.fetched[url] = true
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("Visited : %s, %q 
", url, body)
    c <- urls

}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "http://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "http://golang.org/pkg/",
            "http://golang.org/cmd/",
        },
    },
    "http://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "http://golang.org/",
            "http://golang.org/cmd/",
            "http://golang.org/pkg/fmt/",
            "http://golang.org/pkg/os/",
        },
    },
    "http://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
    "http://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
}

Here's some sample output

Visited : http://golang.org/, "The Go Programming Language" 
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages" 
Visited : http://golang.org/pkg/os/, "Package os" 
**Visited : http://golang.org/pkg/fmt/, "Package fmt"** 

Process finished with exit code 0

Different than the first the last package is missing (deliberately in asterisks above)

Visited : http://golang.org/, "The Go Programming Language" 
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages" 
Visited : http://golang.org/pkg/os/, "Package os"

And finally, even a deadlock in some runs:

Visited : http://golang.org/, "The Go Programming Language" 
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages" 
Visited : http://golang.org/pkg/os/, "Package os" 
Visited : http://golang.org/pkg/fmt/, "Package fmt" 
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x4, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
    /home/kostas/development/challenges/go/helloWorld.go:26 +0x201
main.main()
    /home/kostas/development/challenges/go/helloWorld.go:39 +0xab

goroutine 23 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
    /home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
    /home/kostas/development/challenges/go/helloWorld.go:31 +0x123

goroutine 24 [select]:
main.Crawl(0x4c09f9, 0x16, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
    /home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
    /home/kostas/development/challenges/go/helloWorld.go:31 +0x123

goroutine 5 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
    /home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
    /home/kostas/development/challenges/go/helloWorld.go:31 +0x123

goroutine 6 [select]:
main.Crawl(0x4c0a0f, 0x16, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
    /home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
    /home/kostas/development/challenges/go/helloWorld.go:31 +0x123

I assume it has something to do with concurrency and recursion. I've seen other solutions in git hub that use waiting group and such, but it's not used at the tutorials - tour of go so far so i'd rather not use it yet.

UPDATE

I figured out what is going on and working on the issue. Basically sometimes the select statement gets stuck in an endless loop because the channels quit and c don't always execute in the expected order. I added a default case that prints("nothing to do") and the program sometimes looped forever, sometimes executed by luck in a correct manner. My exit condition is not right

  • 写回答

1条回答 默认 最新

  • dtdfj08626 2017-12-31 15:10
    关注

    I think the case is quite clear. Your channels are messing. Multiple goroutines are recieving from a same channel, and golang just randomly pick one.

    As you send a zero through quit, you never know which goroutine quits: it is randomly picked by the go sheduler. It is possible that a newly generated Crawl recieved from quit before recieving from c (even if both channel are ready).

    And due to that, the depth is a mess and it makes numbers of safeVisit being called unstable, resulting quit issuing different (randomly) signal. Sometimes it is just not enough to quit all goroutines generated, and it is a deadlock.

    Edit:

    First you should understand what your task is. The Crawl function takes in an url, a dep and a fetcher, and it:

    1. Fetch the url
    2. Print the fetched body
    3. Make new Crawl queue generated from the fetched url with dep-1

    Though the tour ask you to "fetch" url in parellel, it is clear that step 2 and step 3 must happen after step 1, meaning it is normal for a single Crawl to wait for the fetch. That means, no need for a new goroutine to call Fetch.

    And on step 3 each new Crawl call has no need to wait the previous to finish, so these calls should be parellel.

    With these analysis, one can come to these code:

    func Crawl(url string, depth int, fetcher Fetcher) {
        // TODO: Fetch URLs in parallel.
        // TODO: Don't fetch the same URL twice.
        // This implementation doesn't do either:
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q
    ", url, body)
        for _, u := range urls {
            go Crawl(u, depth-1, fetcher)
        }
        return
    }
    

    There is one more problem: dealing with a visited url. You have done it well, instead of sending a quit, just make it func(string) bool and call it directly: if Visited(Url) { return } and it is done.

    A side note: the tour is really not good at teaching concurency. You may want to look go blog articles, like golang concurency patterns or share memory by communicating.

    评论

报告相同问题?

悬赏问题

  • ¥15 对于这个问题的代码运行
  • ¥50 三种调度算法报错 有实例
  • ¥15 关于#python#的问题,请各位专家解答!
  • ¥200 询问:python实现大地主题正反算的程序设计,有偿
  • ¥15 smptlib使用465端口发送邮件失败
  • ¥200 总是报错,能帮助用python实现程序实现高斯正反算吗?有偿
  • ¥15 对于squad数据集的基于bert模型的微调
  • ¥15 为什么我运行这个网络会出现以下报错?CRNN神经网络
  • ¥20 steam下载游戏占用内存
  • ¥15 CST保存项目时失败