练习：Web爬网程序-并发不起作用

I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("
found: %s %q

", url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.

Why doesn't the program work when I add the go command to the call of the function Crawl?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dpbz14739 2012-09-03 15:09
关注
The problem seems to be, that your process is exiting before all URLs can be followed by the crawler. Because of the concurrency, the main() procedure is exiting before the workers are finished.

To circumvent this, you could use sync.WaitGroup:

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) { defer wg.Done() if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf(" found: %s %q ", url, body) for _,u := range urls { if used[u] == false { used[u] = true wg.Add(1) go Crawl(u, depth-1, fetcher, wg) } } return }

And call Crawl in main as follows:

func main() { wg := &sync.WaitGroup{} Crawl("http://golang.org/", 4, fetcher, wg) wg.Wait() }

Also, don't rely on the map being thread safe.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

练习：Web爬网程序-并发不起作用
2012-09-01 04:46

回答 2 已采纳 The problem seems to be, that your process is exiting before all URLs can be followed by the crawl
练习：Web爬网程序-打印不起作用
2019-02-13 03:17

回答 1 已采纳 According to the spec Program execution begins by initializing the main package and then invok
进行巡回练习：Web爬网程序-所有goroutine都处于睡眠状态-死锁
2019-07-04 07:30

回答 1 已采纳 The main issue is that you forgot to release the mutex before returning in the if isCrawled {} bra
常见的Web漏洞——XSS
2018-07-08 17:53

江左盟宗主的博客发现服务器端对参数default的值进行了过滤，且不区分大小写，查看网页源代码发现前端并没有对传入的参数进行过滤处理，因此可以在payload前加“#”，从而使其不发送到服务器端，只发送到浏览器，同样也可以使用img...
为什么我的AWS Glue 爬网程序正常结束执行但生成的表格个数为0
2021-08-05 12:14

回答 1 已采纳我自己找到啦！S3的桶名一定要以aws-glue开头，AWS令人无语参考网站 https://stackoverflow.com/questions/68309438/crawl-is-not-
scrapy爬虫出现 DEBUG: Crawled (404) python
2019-04-17 16:25

回答 1 已采纳如果楼主是用scrapy框架爬的话，可以在settings.py加上User-Agent信息，这样应该就可以了
请问selenium访问网站有次数限制吗？ python 有问必答
2021-08-11 10:57

回答 2 已采纳如果短时间内频繁访问网站，很可能会被反爬，你主要看selenium打开的网页是否正常，如果正常就能爬的
开始吧：Golang并发性，第2部分
2020-06-13 23:24

cunjie3951的博客 Web爬网程序示例在上一篇文章中，我展示了Tour of Go中的Web爬虫练习解决方案。我用过goroutines和一个同步映射。我还使用渠道解决了该练习。两种解决方案的完整源代码都可以在GitHub上找到。让我们看一下...
使用jsoup如何爬网页中的回复数据 java javascript
2015-05-11 05:52

回答 1 已采纳我是用的httpclient的，不过差不多，一般回复是 ajax 的数据把，你可以调试一下网页找到跳转的url 继续你的抓取就行了。
在JAVA如何将ASCII码转为utf输出到一个文本文档中？ java javascript
2015-05-12 02:59

回答 2 已采纳这是 json 格式的数据啊。用 org.json 包就行了。会自动转义的。
渗透测试 ( 8 ) --- Burp Suite Pro 官方文档、Brida(frida)
2022-07-03 01:01

擒贼先擒王的博客 burpsuite 是攻击 Web 应用程序的集成平台。功能强大，是渗透测试神器
入门指南：Google Go入门
2020-05-15 22:03

cxu0262的博客基本上，它是一种简洁，简单，安全且快速的编译语言，具有出色的并发功能，并且可以轻松处理大型项目。即使它最初是由Google开发的，它也是免费的开源。语言的一位设计师Rob Pike表示：“ Go项目的目标是消除...
软件测试笔试练习题与参考答案（一）
2019-04-06 15:24

qq29898765的博客软件测试笔试练习题与参考答案（一）测试习题一．测试相关多选题 1.对手机软件的压力测试通常可以包括（ABC） A 存储压力 B 响应能力压力 C 网络流量压力 D 并发压力 2.软件验收测试的合格通过准则（ABCD） A 软件...
疯狂python讲义视频百度云-疯狂Python讲义 PDF高清版附源码
2020-11-01 12:39

weixin_37988176的博客第二部分主要介绍Python常用的内置模块和包，包括支持IO编程的正则表达式，数据库编程，并发等。编程，网络通信编程等内容；第三部分主要介绍Python开发工程。第四部分属于“ Python项目实践”，它通过该项目介绍...
精选的 Go 框架，库和软件的精选清单
2020-05-09 11:24

思月行云的博客它尝试删除尽可能多的样板文件和 “硬东西”，以便每次在 Go 中启动新的 Web 项目时，都可以将其插入，配置并开始构建应用程序，而不必每次都构建身份验证系统。 branca -Branca 令牌的 Golang 实现。 casbin 授权...
Go 相关的框架，库和软件的精选清单
2020-07-03 09:37

baobaodqh的博客它尝试删除尽可能多的样板文件和“硬东西”，以便每次在Go中启动新的Web项目时，都可以将其插入，配置并开始构建应用程序，而不必每次都构建身份验证系统。 branca -Branca令牌的Golang实现。 casbin授权库，支持...
渗透测试工具之——初识burp
2020-04-01 22:18

温柔小薛的博客功能很强大，被称为一款神器，之前我也在很多地方看见这款工具，通过学习初步学会了安装并了解了各种功能及其设置，这里从四大模块做一个简单记录，零碎的东西很多，希望大家还是自己下载之后练习，本文仅供了解使用...
测试用例题目
2019-09-11 22:09

Drcxzhou的博客 String convert(String page)作用是将WEB页转码为方便移动设备查看的页面，为了确保转码的正确性，请设计相应测试策略。　参考答案：　 <1> 基本功能测试：　功能：输入正确的网址，进行转码，检查...
初学且搭建fek日志系统
2019-10-12 14:36

ckyblack的博客注意：不能没有输出，也不能配置多个输出否则filebeat启动不了 filebeat原理： Filebeat包含两个主要组件：input[输入]和harvester[收割机/收集器]。这些组件一起工作以尾部文件并将事件数据发送到您指定的...
没有解决我的问题, 去提问

悬赏问题

¥15 微信公众号自制会员卡没有收款渠道啊
¥15 stable diffusion
¥100 Jenkins自动化部署—悬赏100元
¥15 关于#python#的问题：求帮写python代码
¥20 MATLAB画图图形出现上下震荡的线条
¥15 关于#windows#的问题：怎么用WIN 11系统的电脑克隆WIN NT3.51-4.0系统的硬盘
¥15 perl MISA分析p3_in脚本出错
¥15 k8s部署jupyterlab，jupyterlab保存不了文件
¥15 ubuntu虚拟机打包apk错误
¥199 rust编程架构设计的方案有偿

练习：Web爬网程序-并发不起作用

2条回答 默认 最新

悬赏问题

2条回答默认最新