具有分页处理功能的Golang网络蜘蛛

I'm working on a golang web crawler that should parse the search results on some specific search engine. The main difficulty - parsing with concurrency, or rather, in processing pagination such as ← Previous 1 2 3 4 5 ... 34 Next →. All things work fine except recursive crawling of paginated results. Look at my code:

package main

import (
    "bufio"
    "errors"
    "fmt"
    "net"
    "strings"

    "github.com/antchfx/htmlquery"
    "golang.org/x/net/html"
)

type Spider struct {
    HandledUrls []string
}

func NewSpider(url string) *Spider {
    // ...
}

func requestProvider(request string) string {
    // Everything is good here
}

func connectProvider(url string) net.Conn {
    // Also works
}

// getContents makes request to search engine and gets response body
func getContents(request string) *html.Node {
    // ...
}

// CheckResult controls empty search results
func checkResult(node *html.Node) bool {
    // ...
}

func (s *Spider) checkVisited(url string) bool {
    // ...
}

// Here is the problems
func (s *Spider) Crawl(url string, channelDone chan bool, channelBody chan *html.Node) {
    body := getContents(url)

    defer func() {
        channelDone <- true
    }()

    if checkResult(body) == false {
        err := errors.New("Nothing found there")
        ErrFatal(err)
    }

    channelBody <- body
    s.HandledUrls = append(s.HandledUrls, url)
    fmt.Println("Handled ", url)

    newUrls := s.getPagination(body)

    for _, u := range newUrls {
        fmt.Println(u)
    }

    for i, newurl := range newUrls {
        if s.checkVisited(newurl) == false {
            fmt.Println(i)
            go s.Crawl(newurl, channelDone, channelBody)
        }
    }
}

func (s *Spider) getPagination(node *html.Node) []string {
    // ...
}

func main() {
    request := requestProvider(*requestFlag)
    channelBody := make(chan *html.Node, 120)
    channelDone := make(chan bool)

    var parsedHosts []*Host

    s := NewSpider(request)

    go s.Crawl(request, channelDone, channelBody)

    for {
        select {
        case recievedNode := <-channelBody:
             // ...

            for _, h := range newHosts {
                 parsedHosts = append(parsedHosts, h)
                 fmt.Println("added", h.HostUrl)
            }

        case <-channelDone:
            fmt.Println("Jobs finished")
        }

        break
   }
}

It always returns the first page only, no pagination. Same GetPagination(...) works good. Please tell me, where is my error(s). Hope Google Translate was correct.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
donglu9978 2018-01-11 05:26
关注
The problem is probably that main exits before all goroutine finished.

First, there is a break after the select statement and it runs uncodintionally after first time a channel is read. That ensures the main func returns after the first time you send something over channelBody.

Secondly, using channelDone is not the right way here. The most idomatic approach would be using a sync.WaitGroup. Before starting each goroutine, use WG.Add(1) and replace the defer with defer WG.Done(); In main, use WG.Wait(). Please be aware that you should use a pointer to refer to the WaitGroup. You can read more here.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

具有分页处理功能的Golang网络蜘蛛
2018-01-10 23:39

回答 1 已采纳 The problem is probably that main exits before all goroutine finished. First, there is a break af
Golang-具有多种输入类型的功能
2016-11-03 19:43

回答 1 已采纳 You can either use a type assertion to extract the value from an interface{} func updateval(arg i
使用Golang进行功能测试
2019-05-02 13:54

回答 1 已采纳 You can test your API endpoints by using net/http/httptest package and mocking external dependenci
开源网络爬虫汇总
2017-07-28 10:43

weixin_30480075的博客互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总，因新技术不断发展，新框架层出不穷，此文会不断更新... 交流讨论欢迎推荐你知道的开源网络爬虫，网页抽取框架. 开源网络爬虫QQ交流群:322937592 email ...
gRPC服务器错误处理程序golang http
2017-08-02 08:13

回答 2 已采纳 You need to return empty model.Model object in order for protobufs to be able to properly serialis
vscode配置golang开发环境 golang vscode
2022-06-10 13:11

回答 2 已采纳 1.下载go。2.配置环境变量。3.在任意位置打开cmd进行测试go version4.打开cmd执行go env配置代理。5.vscode打开项目
VScode调试golang代码环境配置 golang ide vscode
2022-08-12 13:59

回答 1 已采纳博客园博客园是一个面向开发者的知识分享社区。 https://www.
开源互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总
2020-02-20 13:09

coloriy的博客互联网爬虫，蜘蛛，数据采集器，网页解析器的汇总。转载自：影音视频技术空间 Python Scrapy- 一种高效的屏幕,网页数据采集框架。 django-dynamic-scraper- 基于Scrapy内核由django Web框架开发的爬虫。 ...
Golang网络监听IPv6
2017-12-19 13:40

回答 1 已采纳 Attempting to bind a link-scoped ipv6 address without a proper scope will result in this error fro
golang指针问题 golang
2022-07-25 18:29

回答 3 已采纳戳啦，这里的 map2 是对 map1 创建了 shallow copy，它们里面的东西装的一样，但是却不是同一个玩意。
golang运行环境的精简版或免安装版 golang
2022-08-15 09:14

回答 4 已采纳免安装版私信我发你
补21.9.13-9.23学习记录
2022-01-04 20:30

kaesarsk的博客 asyncio应用的应用代码显示的处理上下文切换，asyncio提供的框架以事件循环（event loop）为中心,程序开启一个无限循环，程序把一些函数注册到事件循环上，满足事件发生时调用相应协程函数。事件循环：一种处理多...
具有方法/功能的Golang逗号[关闭]
2015-03-30 18:22

回答 2 已采纳 Golang allows trailing commas after many declarations. This was likely an explicit design choice
Go 相关的框架，库和软件的精选清单
2020-07-03 09:37

baobaodqh的博客概述这是一个Go 相关的框架，库和软件的精选清单，引用自 awesome-go项目...用于处理音频的库。 EasyMIDI -EasyMidi是一个简单可靠的库，用于处理标准Midi文件（SMF）。 flac支持FLAC流的Native Go FLAC编码器/...
精选的 Go 框架，库和软件的精选清单
2020-05-09 11:24

思月行云的博客精选的 Go 框架，库和软件的精选清单概述这是一个 Go 相关的框架，库和软件的精选清单，引用自awesome-go项目，并翻译补充而来这是一个 Go 相关的... EasyMIDI-EasyMidi 是一个简单可靠的库，用于处理标准 Midi...
没有解决我的问题, 去提问

悬赏问题

¥15 fluent的在模拟压强时使用希望得到一些建议
¥15 STM32驱动继电器
¥15 Windows server update services
¥15 关于#c语言#的问题：我现在在做一个墨水屏设计，2.9英寸的小屏怎么换4.2英寸大屏
¥15 模糊pid与pid仿真结果几乎一样
¥15 java的GUI的运用
¥15 Web.config连不上数据库
¥15 我想付费需要AKM公司DSP开发资料及相关开发。
¥15 怎么配置广告联盟瀑布流
¥15 Rstudio 保存代码闪退

具有分页处理功能的Golang网络蜘蛛

1条回答 默认 最新

悬赏问题

1条回答默认最新