dtmwnqng38644 2019-06-20 20:41
浏览 49

尝试使用Go例程时无法刮取数据

I am trying to scrape related words for a given word for which I am using BFS starting with the given word and searching through each related word on dictionary.com

I have tried this code without concurrency and it works just fine, but takes a lot of time hence, tried using go routines but my code gets stuck after the first iteration. The first level of BFS works just fine but then in the second level it hangs!

package main

import (
    "fmt"
    "github.com/gocolly/colly"
    "sync"
)

var wg sync.WaitGroup


func buildURL(word string) string {
    return "https://www.dictionary.com/browse/" + string(word)
}

func get(url string) []string {
    c := colly.NewCollector()
    c.IgnoreRobotsTxt = true
    var ret []string
    c.OnHTML("a.css-cilpq1.e15p0a5t2", func(e *colly.HTMLElement) {
        ret = append(ret, string(e.Text))
    })
    c.Visit(url)
    c.Wait()

    return ret
}

func threading(c chan []string, word string) {
    defer wg.Done()
    var words []string
    for _, w := range get(buildURL(word)) {
        words = append(words, w)
    }
    c <- words
}

func main() {
    fmt.Println("START")
    word := "jump"
    maxDepth := 2

    //bfs
    var q map[string]int
    nq := map[string]int {
        word: 0,
    }

    vis := make(map[string]bool)
    queue := make(chan []string, 5000)

    for i := 1; i <= maxDepth; i++ {
        fmt.Println(i)
        q, nq = nq, make(map[string]int)
        for word := range q {
            if _, ok := vis[word]; !ok {
                wg.Add(1)
                vis[word] = true
                go threading(queue, word)

                for v := range queue {
                    fmt.Println(v)
                    for _, w := range v {
                        nq[w] = i
                    }
                }
            }
        }
    }
    wg.Wait()
    close(queue)
    fmt.Println("END")
}

OUTPUT:

START
1
[plunge dive rise upsurge bounce hurdle fall vault drop advance upturn inflation increment spurt boost plummet skip bound surge take]

hangs just here forever, counter = 2 is not printed!

Can check here https://www.dictionary.com/browse/jump for the related words.

  • 写回答

1条回答 默认 最新

  • dongzhansong5785 2019-06-22 19:44
    关注

    According to Tour of Go

    Sends to a buffered channel block only when the buffer is full. Receives block when the buffer is empty.

    So, in this case, you are creating a buffered channel using 5000 as length.

    for i := 1; i <= maxDepth; i++ {
        fmt.Println(i)
        q, nq = nq, make(map[string]int)
        for word := range q { // for each word
            if _, ok := vis[word]; !ok { // if not visited visit
                wg.Add(1) // add a worker
                vis[word] = true
                go threading(queue, word) // fetch in concurrent manner
                for v := range queue { // <<< blocks here when queue is empty
                    fmt.Println(v)
                    for _, w := range v {
                        nq[w] = i
                    }
                }
            }
        }
    }
    

    As you can see I've commented in the code, after 1st iteration the for loop gonna block until channel is empty. In this case after fetching jump It sends the array corresponding similar words, but after that as the for loop is blocking as zerkems explains you will not get to next iteration(i = 2). You can ultimately close the channel to end the blocking in for loop. But since you use the same channel to write over multiple goroutines it will panic if you closed it from multiple goroutines.

    To overcome this we can come up with a nice workaround.

    1. We exactly know how much unvisited items we are fetching for.
    2. We now know where is the block

    First, we need to count the unvisited words and then we can iterate that much of the time

        vis := make(map[string]bool)
        queue := make(chan []string, 5000)
        for i := 1; i <= maxDepth; i++ {
            fmt.Println(i)
            q, nq = nq, make(map[string]int)
            unvisited := 0
            for word := range q {
                if _, ok := vis[word]; !ok {
                    vis[word] = true
                    unvisited++
                    wg.Add(1)
                    go threading(queue, word)
                }
            }
    
            wg.Wait() // wait until jobs are done
            for j := 0; j < unvisited; j++ { // << does not block as we know how much
                v := <-queue // we exactly try to get unvisited amount
                fmt.Println(v)
                for _, w := range v {
                    nq[w] = i
                }
            }
        }
    

    In this situation, we are simply counting what is the minimum iterations we need to go to get results. Also, you can see that I've moved down the for loop outer and use original one to just add words to workers. It will ask to fetch all words and will wait in the following loop to complete there tasks in a non-blocking way.

    Latter loop waits until all workers are done. After that next iteration, works and next level of BFS can be reached.

    Summary

    1. Distribute workload
    2. Wait for results
    3. Don't do both at the same time

    Hope this helps.

    评论

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大