进行巡回练习：Web爬网程序-所有goroutine都处于睡眠状态-死锁

Exercise from: https://tour.golang.org/concurrency/10

Description:

In this exercise you'll use Go's concurrency features to parallelize a web crawler.

Modify the Crawl function to fetch URLs in parallel without fetching the same URL twice.

Hint: you can keep a cache of the URLs that have been fetched on a map, but maps alone are not safe for concurrent use!

Here's my answer:

package main

import (
    "fmt"
    "sync"
)

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}

var crawledURLs = make(map[string]bool)
var mux sync.Mutex

func CrawlURL(url string, depth int, fetcher Fetcher, quit chan bool) {
    defer func() { quit <- true }()
    if depth <= 0 {
        return
    }

    mux.Lock()
    _, isCrawled := crawledURLs[url]
    if isCrawled {
        return
    }
    crawledURLs[url] = true
    mux.Unlock()

    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q
", url, body)
    quitThis := make(chan bool)
    for _, u := range urls {
        go CrawlURL(u, depth-1, fetcher, quitThis)
    }
    for range urls {
        <-quitThis
    }
    return
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
    CrawlURL(url, depth, fetcher, make(chan bool))
    return
}

func main() {
    Crawl("https://golang.org/", 4, fetcher)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "https://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "https://golang.org/pkg/",
            "https://golang.org/cmd/",
        },
    },
    "https://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "https://golang.org/",
            "https://golang.org/cmd/",
            "https://golang.org/pkg/fmt/",
            "https://golang.org/pkg/os/",
        },
    },
    "https://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
    "https://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
}

And the output:

found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
fatal error: all goroutines are asleep - deadlock!

I was wondering why will deadlock happen? Is it because I use channels in the wrong way?

Noting that I forgot to release the mutex in the if isCrawled {} branch, so I've edited my code like this:

...
    if isCrawled {
        mux.Unlock() // added this line
        return
    }
...

But the deadlock still exists, and the output is different:

found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"
fatal error: all goroutines are asleep - deadlock!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

dtxf759200 2019-07-04 07:58

关注

The main issue is that you forgot to release the mutex before returning in the if isCrawled {} branch.

Moreover, I would suggest to use synchronization APIs if you actually need to synchronize goroutines. Channels are better used for communicating and sharing data.

This is the solution with sync.WaitGroup: https://play.golang.org/p/slrnmr3sPrs

Here is instead you solution with only channels: https://play.golang.org/p/FbPXxPSXvFL

The problem was that the very first time you call CrawlURL() you are not reading from the channel you pass as argument. Therefore, once that function tries to send something into it through defer func() { quit <- true }(), it block forever and never returns.

package main

import (
    "fmt"
    "sync"
)

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}

var crawledURLs = make(map[string]bool)
var mux sync.Mutex

func CrawlURL(url string, depth int, fetcher Fetcher, quit chan bool) {
    //For very first function instance, this would block forever if 
    //nobody is receiving from the other end of this channel
    defer func() { quit <- true }()

    if depth <= 0 {
        return
    }

    mux.Lock()
    _, isCrawled := crawledURLs[url]
    if isCrawled {
        mux.Unlock()
        return
    }
    crawledURLs[url] = true
    mux.Unlock()

    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q
", url, body)
    quitThis := make(chan bool)
    for _, u := range urls {
        go CrawlURL(u, depth-1, fetcher, quitThis)
    }
    for range urls {
        <-quitThis
    }
    return
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
    lastQuit := make(chan bool)
    go CrawlURL(url, depth, fetcher, lastQuit)
    //You need to receive from this channel in order to
    //unblock the called function
    <-lastQuit
    return
}

func main() {
    Crawl("https://golang.org/", 10, fetcher)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "https://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "https://golang.org/pkg/",
            "https://golang.org/cmd/",
        },
    },
    "https://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "https://golang.org/",
            "https://golang.org/cmd/",
            "https://golang.org/pkg/fmt/",
            "https://golang.org/pkg/os/",
        },
    },
    "https://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
    "https://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "https://golang.org/",
            "https://golang.org/pkg/",
        },
    },
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

进行巡回练习：Web爬网程序-所有goroutine都处于睡眠状态-死锁
2019-07-04 07:30

回答 1 已采纳 The main issue is that you forgot to release the mutex before returning in the if isCrawled {} bra
练习：Web爬网程序-打印不起作用
2019-02-13 03:17

回答 1 已采纳 According to the spec Program execution begins by initializing the main package and then invok
练习：Web爬网程序-并发不起作用
2012-09-01 04:46

回答 2 已采纳 The problem seems to be, that your process is exiting before all URLs can be followed by the crawl
Python中一个强大的Spider：Web爬网程序
2024-03-01 21:47

Python中一个强大的Spider（Web爬网程序）在Python中，一个强大的Spider（爬虫）是指一个具有高度灵活性和扩展性的网络爬取程序。它能够根据特定需求自动抓取互联网上的数据并进行处理。以下是一个强大的Spider...
为什么我的AWS Glue 爬网程序正常结束执行但生成的表格个数为0
2021-08-05 12:14

回答 1 已采纳我自己找到啦！S3的桶名一定要以aws-glue开头，AWS令人无语参考网站 https://stackoverflow.com/questions/68309438/crawl-is-not-
scrapy爬虫出现 DEBUG: Crawled (404) python
2019-04-17 16:25

回答 1 已采纳如果楼主是用scrapy框架爬的话，可以在settings.py加上User-Agent信息，这样应该就可以了
请问selenium访问网站有次数限制吗？ python 有问必答
2021-08-11 10:57

回答 2 已采纳如果短时间内频繁访问网站，很可能会被反爬，你主要看selenium打开的网页是否正常，如果正常就能爬的
roadburn-redux-crawler:Web爬网程序，用于从Roadburn Redux获取所有视频URL（2021）
2021-04-20 04:09

这是我进行的一个快速的Haskell项目，目的是获取在2021年4月15日至18日之间发布的视频URL，这些视频URL是Roadburn Redux的一部分。使用cabal run roadburn-redux-crawler执行此操作，应该创建并在roadburn-redux-...
使用jsoup如何爬网页中的回复数据 java javascript
2015-05-11 05:52

回答 1 已采纳我是用的httpclient的，不过差不多，一般回复是 ajax 的数据把，你可以调试一下网页找到跳转的url 继续你的抓取就行了。
在JAVA如何将ASCII码转为utf输出到一个文本文档中？ java javascript
2015-05-12 02:59

回答 2 已采纳这是 json 格式的数据啊。用 org.json 包就行了。会自动转义的。
ps5-stock-checker:Web爬网程序，用于检查Vatan Computer中PS5的库存
2021-05-09 17:43

PS5-股票检查器 Web爬网程序，用于检查Vatan Computer中PS5的库存如何安装？转到项目文件夹并运行npm install如何运行项目在根文件夹中运行node index.js执照ps5-stock-checker是。
javaweb修改源码-Web-Crawler-:Web爬网程序Java源代码。对其进行修改以收集和存储包含特定单词的链接
2021-05-19 20:12

java web 修改源码 "# Web-Crawler-
news-crawl:使用Storm-Crawler进行新闻爬网-将内容存储为WARC
2021-05-25 08:52

新闻草稿基于新闻。产生WARC文件，以将其存储为一部分。数据托管为 –如果您要使用数据而... 在编辑器中打开文件conf/crawler-conf.yaml ，并填写http.agent.name的值以及以http.agent.name所有其他属性http.agent.
web-bee：:honeybee:有趣的Web垂直爬网程序框架
2021-02-04 21:11

停止维护 webBee为乐趣而爬 webBee基于jdk8是一个持续成长的垂直爬虫框架项目 webBee MIT开放协议 webBee是一个不错的java进阶项目欢迎大家贡献代码，如果觉得这个项目不错，请为它点赞演示站点 ...
webscrapper：Web爬网程序
2021-02-17 09:22

Webscrapper 我第一次尝试使用python网络抓取类似Newegg的网站以将产品详细信息列出到csv文件中。
algolia-webcrawler:简单的节点工作程序，可对站点地图进行爬网，以使algolia索引保持最新状态
2021-05-13 14:21

简单节点工作程序，可对站点地图进行爬网，以使索引保持最新状态。它使用简单CSS选择器来查找要索引的实际文本内容。该应用程序使用。 TL; DR 配置选项存储对象索引编制执照用法该脚本应通过crontab运行...
抢：Web爬网框架
2021-02-04 20:23

抢框架文档安装 $ pip install -U grab 在此处查看有关在不同平台上安装Grab的详细信息，支持 ...使用Web表单的工具轻松的多部分文件上传灵活定制HTTP请求自动字符集检测强大的API使用XPATH
R-Codes:从Burpple和Tripadvisor进行Web爬网，数据分析和预测模型
2021-05-01 07:14

所需的软件包在每个代码的顶部列出，请确保在运行任何程序之前已安装了它们。使用R进行网页爬虫我提供了2种使用R进行爬网的方法，一种使用html包，另一种使用RSelenium包。 html包速度更快，但可能不适用于使用...
spiderweb:爬网程序是一种简单的爬网机制，没有重大优化
2021-05-11 13:22

爬网简单的网络爬虫| 蜘蛛网 Crawler是一种简单的爬网机制，没有重大优化。我在5个小时内就完成了，很抱歉由于正在进行的工作我无法投资更多。不得不说的是，如果您希望我做的话，我可以进行大量优化。堆 BackEnd是...
scrapy-azuresearch-crawler-samples：Scrapy作为Azure搜索示例的Web爬网程序
2021-01-30 03:23

Scrapy作为Azure搜索示例的Web爬网程序样品：Web Scraping的报价列表，并使用Azure搜索将它们编入索引：在东京进行Web爬网的工作，并使用Azure搜索对其进行索引主义：Web搜集主义的所有博客文章，并使用Azure...
没有解决我的问题, 去提问

悬赏问题

¥15 如何在scanpy上做差异基因和通路富集？
¥20 关于#硬件工程#的问题，请各位专家解答！
¥15 关于#matlab#的问题：期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707，使系统具有较小的超调量
¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配
¥15 Power query添加列问题
¥50 Kubernetes&Fission&Eleasticsearch
¥15 報錯：Person is not mapped，如何解決？
¥15 c++头文件不能识别CDialog

码龄粉丝数原力等级 --

进行巡回练习：Web爬网程序-所有goroutine都处于睡眠状态-死锁

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

进行巡回练习：Web爬网程序-所有goroutine都处于睡眠状态-死锁

1条回答 默认 最新

悬赏问题

1条回答默认最新