具有分页处理功能的Golang网络蜘蛛

I'm working on a golang web crawler that should parse the search results on some specific search engine. The main difficulty - parsing with concurrency, or rather, in processing pagination such as ← Previous 1 2 3 4 5 ... 34 Next →. All things work fine except recursive crawling of paginated results. Look at my code:

package main

import (
    "bufio"
    "errors"
    "fmt"
    "net"
    "strings"

    "github.com/antchfx/htmlquery"
    "golang.org/x/net/html"
)

type Spider struct {
    HandledUrls []string
}

func NewSpider(url string) *Spider {
    // ...
}

func requestProvider(request string) string {
    // Everything is good here
}

func connectProvider(url string) net.Conn {
    // Also works
}

// getContents makes request to search engine and gets response body
func getContents(request string) *html.Node {
    // ...
}

// CheckResult controls empty search results
func checkResult(node *html.Node) bool {
    // ...
}

func (s *Spider) checkVisited(url string) bool {
    // ...
}

// Here is the problems
func (s *Spider) Crawl(url string, channelDone chan bool, channelBody chan *html.Node) {
    body := getContents(url)

    defer func() {
        channelDone <- true
    }()

    if checkResult(body) == false {
        err := errors.New("Nothing found there")
        ErrFatal(err)
    }

    channelBody <- body
    s.HandledUrls = append(s.HandledUrls, url)
    fmt.Println("Handled ", url)

    newUrls := s.getPagination(body)

    for _, u := range newUrls {
        fmt.Println(u)
    }

    for i, newurl := range newUrls {
        if s.checkVisited(newurl) == false {
            fmt.Println(i)
            go s.Crawl(newurl, channelDone, channelBody)
        }
    }
}

func (s *Spider) getPagination(node *html.Node) []string {
    // ...
}

func main() {
    request := requestProvider(*requestFlag)
    channelBody := make(chan *html.Node, 120)
    channelDone := make(chan bool)

    var parsedHosts []*Host

    s := NewSpider(request)

    go s.Crawl(request, channelDone, channelBody)

    for {
        select {
        case recievedNode := <-channelBody:
             // ...

            for _, h := range newHosts {
                 parsedHosts = append(parsedHosts, h)
                 fmt.Println("added", h.HostUrl)
            }

        case <-channelDone:
            fmt.Println("Jobs finished")
        }

        break
   }
}

It always returns the first page only, no pagination. Same GetPagination(...) works good. Please tell me, where is my error(s). Hope Google Translate was correct.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
donglu9978 2018-01-11 05:26
关注
The problem is probably that main exits before all goroutine finished.

First, there is a break after the select statement and it runs uncodintionally after first time a channel is read. That ensures the main func returns after the first time you send something over channelBody.

Secondly, using channelDone is not the right way here. The most idomatic approach would be using a sync.WaitGroup. Before starting each goroutine, use WG.Add(1) and replace the defer with defer WG.Done(); In main, use WG.Wait(). Please be aware that you should use a pointer to refer to the WaitGroup. You can read more here.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

具有分页处理功能的Golang网络蜘蛛
2018-01-10 23:39

回答 1 已采纳 The problem is probably that main exits before all goroutine finished. First, there is a break af
Golang-具有多种输入类型的功能
2016-11-03 19:43

回答 1 已采纳 You can either use a type assertion to extract the value from an interface{} func updateval(arg i
使用Golang进行功能测试
2019-05-02 13:54

回答 1 已采纳 You can test your API endpoints by using net/http/httptest package and mocking external dependenci
开源网络爬虫汇总
2017-07-28 10:43

weixin_30480075的博客互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总，因新技术不断发展，新框架层出不穷，此文会不断更新... 交流讨论欢迎推荐你知道的开源网络爬虫，网页抽取框架. 开源网络爬虫QQ交流群:322937592 email ...
gRPC服务器错误处理程序golang http
2017-08-02 08:13

回答 2 已采纳 You need to return empty model.Model object in order for protobufs to be able to properly serialis
vscode配置golang开发环境 golang vscode
2022-06-10 13:11

回答 2 已采纳 1.下载go。2.配置环境变量。3.在任意位置打开cmd进行测试go version4.打开cmd执行go env配置代理。5.vscode打开项目
VScode调试golang代码环境配置 golang ide vscode
2022-08-12 13:59

回答 1 已采纳博客园博客园是一个面向开发者的知识分享社区。 https://www.
开源互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总
2020-02-20 13:09

coloriy的博客互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总。转载自：影音视频技术空间 Python Scrapy- 一种高效的屏幕,网页数据采集框架。 django-dynamic-scraper- 基于Scrapy内核由django Web框架开发的爬虫。 ...
Golang网络监听IPv6
2017-12-19 13:40

回答 1 已采纳 Attempting to bind a link-scoped ipv6 address without a proper scope will result in this error fro
golang指针问题 golang
2022-07-25 18:29

回答 3 已采纳戳啦，这里的 map2 是对 map1 创建了 shallow copy，它们里面的东西装的一样，但是却不是同一个玩意。
golang运行环境的精简版或免安装版 golang
2022-08-15 09:14

回答 4 已采纳免安装版私信我发你
补21.9.13-9.23学习记录
2022-01-04 20:30

kaesarsk的博客 asyncio应用的应用代码显示的处理上下文切换，asyncio提供的框架以事件循环（event loop）为中心,程序开启一个无限循环，程序把一些函数注册到事件循环上，满足事件发生时调用相应协程函数。事件循环：一种处理多...
具有方法/功能的Golang逗号[关闭]
2015-03-30 18:22

回答 2 已采纳 Golang allows trailing commas after many declarations. This was likely an explicit design choice
Go 相关的框架，库和软件的精选清单
2020-07-03 09:37

baobaodqh的博客概述这是一个Go 相关的框架，库和软件的精选清单，引用自 awesome-go项目...用于处理音频的库。 EasyMIDI -EasyMidi是一个简单可靠的库，用于处理标准Midi文件（SMF）。 flac支持FLAC流的Native Go FLAC编码器/...
精选的 Go 框架，库和软件的精选清单
2020-05-09 11:24

思月行云的博客精选的 Go 框架，库和软件的精选清单概述这是一个 Go 相关的框架，库和软件的精选清单，引用自awesome-go项目，并翻译补充而来这是一个 Go 相关的... EasyMIDI-EasyMidi 是一个简单可靠的库，用于处理标准 Midi...
没有解决我的问题, 去提问

悬赏问题

¥15 AT89C51控制8位八段数码管显示时钟。
¥15 真我手机蓝牙传输进度消息被关闭了，怎么打开？(关键词-消息通知)
¥15 下图接收小电路，谁知道原理
¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？
¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
¥15 手机接入宽带网线，如何释放宽带全部速度
¥30 关于#r语言#的问题：如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
¥15 ETLCloud 处理json多层级问题
¥15 matlab中使用gurobi时报错
¥15 这个主板怎么能扩出一两个sata口

具有分页处理功能的Golang网络蜘蛛

1条回答 默认 最新

悬赏问题

1条回答默认最新