drob50257447 2018-01-10 23:39
浏览 20
已采纳

具有分页处理功能的Golang网络蜘蛛

I'm working on a golang web crawler that should parse the search results on some specific search engine. The main difficulty - parsing with concurrency, or rather, in processing pagination such as ← Previous 1 2 3 4 5 ... 34 Next →. All things work fine except recursive crawling of paginated results. Look at my code:

package main

import (
    "bufio"
    "errors"
    "fmt"
    "net"
    "strings"

    "github.com/antchfx/htmlquery"
    "golang.org/x/net/html"
)

type Spider struct {
    HandledUrls []string
}

func NewSpider(url string) *Spider {
    // ...
}

func requestProvider(request string) string {
    // Everything is good here
}

func connectProvider(url string) net.Conn {
    // Also works
}

// getContents makes request to search engine and gets response body
func getContents(request string) *html.Node {
    // ...
}

// CheckResult controls empty search results
func checkResult(node *html.Node) bool {
    // ...
}

func (s *Spider) checkVisited(url string) bool {
    // ...
}

// Here is the problems
func (s *Spider) Crawl(url string, channelDone chan bool, channelBody chan *html.Node) {
    body := getContents(url)

    defer func() {
        channelDone <- true
    }()

    if checkResult(body) == false {
        err := errors.New("Nothing found there")
        ErrFatal(err)
    }

    channelBody <- body
    s.HandledUrls = append(s.HandledUrls, url)
    fmt.Println("Handled ", url)

    newUrls := s.getPagination(body)

    for _, u := range newUrls {
        fmt.Println(u)
    }

    for i, newurl := range newUrls {
        if s.checkVisited(newurl) == false {
            fmt.Println(i)
            go s.Crawl(newurl, channelDone, channelBody)
        }
    }
}

func (s *Spider) getPagination(node *html.Node) []string {
    // ...
}

func main() {
    request := requestProvider(*requestFlag)
    channelBody := make(chan *html.Node, 120)
    channelDone := make(chan bool)

    var parsedHosts []*Host

    s := NewSpider(request)

    go s.Crawl(request, channelDone, channelBody)

    for {
        select {
        case recievedNode := <-channelBody:
             // ...

            for _, h := range newHosts {
                 parsedHosts = append(parsedHosts, h)
                 fmt.Println("added", h.HostUrl)
            }

        case <-channelDone:
            fmt.Println("Jobs finished")
        }

        break
   }
}

It always returns the first page only, no pagination. Same GetPagination(...) works good. Please tell me, where is my error(s). Hope Google Translate was correct.

  • 写回答

1条回答 默认 最新

  • donglu9978 2018-01-11 05:26
    关注

    The problem is probably that main exits before all goroutine finished.

    First, there is a break after the select statement and it runs uncodintionally after first time a channel is read. That ensures the main func returns after the first time you send something over channelBody.

    Secondly, using channelDone is not the right way here. The most idomatic approach would be using a sync.WaitGroup. Before starting each goroutine, use WG.Add(1) and replace the defer with defer WG.Done(); In main, use WG.Wait(). Please be aware that you should use a pointer to refer to the WaitGroup. You can read more here.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 fluent的在模拟压强时使用希望得到一些建议
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services
  • ¥15 关于#c语言#的问题:我现在在做一个墨水屏设计,2.9英寸的小屏怎么换4.2英寸大屏
  • ¥15 模糊pid与pid仿真结果几乎一样
  • ¥15 java的GUI的运用
  • ¥15 Web.config连不上数据库
  • ¥15 我想付费需要AKM公司DSP开发资料及相关开发。
  • ¥15 怎么配置广告联盟瀑布流
  • ¥15 Rstudio 保存代码闪退