何时何地检查通道是否不再有任何数据？

I'm trying to solve Exercise: Web Crawler

In this exercise you'll use Go's concurrency features to parallelize a web crawler.

Modify the Crawl function to fetch URLs in parallel without fetching the same URL twice.

When should I check if all urls already been crawled? (or how could I know if there will be no more data queued?)

package main

import (
    "fmt"
)

type Result struct {
    Url string
    Depth int
}

type Stor struct {
    Queue  chan Result
    Visited map[string]int
}    

func NewStor() *Stor {
    return &Stor{
        Queue:  make(chan Result,1000),
        Visited: map[string]int{},
    }
}

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(res Result, fetcher Fetcher, stor *Stor) {
    defer func() {          
        /*
        if len(stor.Queue) == 0 {
            close(stor.Queue)
        }   
        */  // this is wrong, it makes the channel closes too early
    }()
    if res.Depth <= 0 {
        return
    }
    // TODO: Don't fetch the same URL twice.
    url := res.Url
    stor.Visited[url]++
    if stor.Visited[url] > 1 {
        fmt.Println("skip:",stor.Visited[url],url)
        return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }   
    fmt.Printf("found: %s %q
", url, body)
    for _, u := range urls {
        stor.Queue <- Result{u,res.Depth-1}
    }
    return
}

func main() {
    stor := NewStor()   
    Crawl(Result{"http://golang.org/", 4}, fetcher, stor)
    for res := range stor.Queue {
        // TODO: Fetch URLs in parallel.
        go Crawl(res,fetcher,stor)
    }
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "http://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "http://golang.org/pkg/",
            "http://golang.org/cmd/",
        },
    },
    "http://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "http://golang.org/",
            "http://golang.org/cmd/",
            "http://golang.org/pkg/fmt/",
            "http://golang.org/pkg/os/",
        },
    },
    "http://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
    "http://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
}

The output was a deadlock since the stor.Queue channel never closed.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

duanji5746 2015-01-09 13:55

关注

Simplest way to wait until all goroutings are done is sync.WaitGroup in sync package

package main
import "sync"
var wg sync.WaitGroup
//then you do
func Crawl(res Result, fetcher Fetcher) { //what for you pass stor *Stor as arg? It just visible for all goroutings
    defer wg.Done()
...
//why not to spawn new routing just inside Crowl?
    for res := range urls {
        wg.Add(1)
        go Crawl(res,fetcher)
    }
...
}
...
//And in main.main()
func main() {
    wg.Add(1) 
    Crawl(Result{"http://golang.org/", 4}, fetcher)
    ...
    wg.Wait() //Will block until all routings Done
}

Complete solution will be:

package main

import (
    "fmt"
    "sync"
)
var wg sync.WaitGroup
var visited map[string]int = map[string]int{}

type Result struct {
    Url string
    Depth int
}

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(res Result, fetcher Fetcher) {
    defer wg.Done()
    if res.Depth <= 0 {
        return
    }
    // TODO: Don't fetch the same URL twice.
    url := res.Url
    visited[url]++
    if visited[url] > 1 {
        fmt.Println("skip:",visited[url],url)
        return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }   
    fmt.Printf("found: %s %q
", url, body)
    for _, u := range urls {
        wg.Add(1)
        go Crawl( Result{u,res.Depth-1},fetcher)
        //stor.Queue <- Result{u,res.Depth-1}
    }
    return
}

func main() {
    wg.Add(1) 
    Crawl(Result{"http://golang.org/", 4}, fetcher)
    wg.Wait()
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "http://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "http://golang.org/pkg/",
            "http://golang.org/cmd/",
        },
    },
    "http://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "http://golang.org/",
            "http://golang.org/cmd/",
            "http://golang.org/pkg/fmt/",
            "http://golang.org/pkg/os/",
        },
    },
    "http://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
    "http://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

何时何地检查通道是否不再有任何数据？
2015-01-08 13:18

回答 2 已采纳 Simplest way to wait until all goroutings are done is sync.WaitGroup in sync package package main
如何在使用PHP插入到db之前检查任何输入值是否有数据？ ajax javascript mysql php
2018-09-16 11:30

回答 1 已采纳 So you want to check if any one of the variables has a value, then insert. But if all of the varia
如何检查URL上是否有任何内容？ php
2011-11-11 20:48

回答 4 已采纳 You can use php get_headers($url), which will return false in case there isn't an answer
【数据结构课程设计】离散事件模拟——海关检查站模拟
2024-01-18 21:42

197221何晓棠的博客主要分为eventlist类存储事件列表，用链表的方式记录每个时刻待发生的事件，用以之后处理这些事件，并记录下排队时间、处理时间等数据，用来最后显示测试结果；linkqueue类存储检查站的排队队列，用链表的方式存储...
PHP：检查URL中给定字符串后是否有任何内容？ php
2017-11-23 05:55

回答 1 已采纳 Use a regular expression that looks for the $brand at the end of the string. if (! preg_match("/{
如何检查是否设置了任何变量？ php
2016-11-08 20:53

回答 2 已采纳 You have to pass strings instead of the variables, but for fun: if(compact('var1', 'var2', 'var3'
如何通过recaptcha检查HTML表单中的数据？ html javascript php
2019-06-03 10:30

回答 1 已采纳 I managed to solve this problem by changing server-side code like below, thanks to this Recaptcha
[论文阅读] (11)ACE算法和暗通道先验图像去雾算法（Rizzi | 何恺明老师）
2021-11-04 10:56

Eastmount的博客这篇文章将讲解ACE去雾算法、暗通道先验去雾算法以及雾化生成算法，并且参考了两位计算机视觉大佬（Rizzi 何恺明）的论文。希望这篇文章对您有所帮助，这些大佬是真的值得我们去学习，献上小弟的膝盖~fighting！
如何检查工人通道是否打开？
2018-01-08 04:12

回答 1 已采纳 One option to use two channels, one for ready and one for the actual work: func worker(ready chan
浏览器显示检查某个域名中是否有拼写错误该怎么检查呢？ javascript
2021-11-11 10:29

回答 1 已采纳它只是告诉你，这个网站打不开，你看下是不是输错了网址，没输错的话你看下是不是本地网络有问题
PHP是否可以从表中检查变量是数组的数据？ php
2017-10-13 02:09

回答 1 已采纳 You need to single quote each IP in the "IN" value, you have: "... P.ALLOWED_IP_ADDRESS IN (".imp
数据采集方式有哪些，都有什么特点？
2023-04-20 14:16

亿信华辰软件的博客数据采集，又称数据获取，是利用一种装置，从系统外部采集数据并输入到系统内部的一个接口。在互联网行业快速发展的今天，数据采集已经被广泛应用于人工智能等相关领域，摄像头、麦克风等，都是数据采集的工具。数据...
如何非破坏性地检查HTTP客户端是否已关闭连接？ http
2018-08-21 09:39

回答 1 已采纳 Interrogate the request context: For incoming server requests, the context is canceled when th
Zoom的Web客户端与WebRTC有何不同？
2018-11-02 07:30

LiveVideoStack_的博客通过WebSockets传输编码后的数据，可以使用Chrome优秀的调试工具检查RTP头和一些帧来显示H264荷载。 02000000 9062 ae85bb9c9d7801000401bede0004124000003588b8021302135000000000 1 c800000016764001eac1b1a68280...
数据资产目录建设指南
2024-03-25 09:47

SuperTech2024的博客其次,数据资产目录中的访问审计机制会记录所有对数据资产的访问和操作行为,因此可以依据审计记录对数据使用的合规性进行检查。总的来说,数据资产目录为数据资产的全生命周期管理提供了集中统一的支撑,有助于从策略、...
没有解决我的问题, 去提问

悬赏问题

¥15 想通过pywinauto自动电机应用程序按钮，但是找不到应用程序按钮信息
¥15 MATLAB中streamslice问题
¥15 如何在炒股软件中，爬到我想看的日k线
¥15 51单片机中C语言怎么做到下面类似的功能的函数（相关搜索：c语言）
¥15 seatunnel 怎么配置Elasticsearch
¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
¥15 (标签-MATLAB|关键词-多址)
¥15 关于#MATLAB#的问题，如何解决？（相关搜索：信噪比，系统容量）
¥500 52810做蓝牙接受端
¥15 基于PLC的三轴机械手程序

码龄粉丝数原力等级 --

何时何地检查通道是否不再有任何数据？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

何时何地检查通道是否不再有任何数据？

2条回答 默认 最新

悬赏问题

2条回答默认最新