duansaoguan7955 2018-06-29 03:02
浏览 212
已采纳

限制gocolly一次处理有限数量的网址

I am trying to use gocolly's Parallelism setting to throttle scraping a maximum number of URLs at a time.

Using the code I've pasted below, I am getting this output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

Which shows that the visits are not blocking with the max number of threads given. When adding more URLs, they are sent all together resulting in a ban from the server.

How can I configure the library to get the following output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv

Here is the code:

const (
    letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    URL = "https://www.google.com/search?q="
)

func RandStringBytes(n int) chan string {
    out := make(chan string)
    quit := make(chan int)

    go func() { 
        for i := 1; i <= 5; i++ {
            b := make([]byte, n)
            for i := range b {
                b[i] = letterBytes[rand.Intn(len(letterBytes))]
            }
            out <- string(b)
        }
        close(out)
        quit <- 0
    }()
    return out
}

func main() {
    c := RandStringBytes(6) 
    collector := colly.NewCollector(
        colly.AllowedDomains("www.google.com"),
        colly.Async(true),
        colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
    )   

    collector.Limit(&colly.LimitRule{
        DomainRegexp: "www.google.com",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })
    collector.OnResponse(func(r *colly.Response) {
        url := r.Ctx.Get("url")
        fmt.Println("Done visiting", url)
    })
    collector.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
        fmt.Println("Visiting", r.URL.String())
    })
    collector.OnError(func(r *colly.Response, err error) {
        fmt.Println(err)
    })

    for w := range c {
        collector.Visit(URL+w)
    }

    collector.Wait()
}


Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv
  • 写回答

1条回答 默认 最新

  • douluogu8713 2018-07-20 23:09
    关注

    OnRequest is done before the request is actually sent to the server. Your debug statement is misleading: fmt.Println("Visiting", r.URL.String()) should probably be: fmt.Println("Preparing request for:", r.URL.String()).

    I thought your question was interesting, so I set up a local test case with python's http.server like so:

    $ cd $(mktemp -d) # make temp dir
    $ for n in {0..99}; do touch $n; done # make 100 empty files
    $ python3 -m http.server # start up test server
    

    Then modify your code above:

    package main
    
    import (
        "fmt"
        "strconv"
        "time"
    
        "github.com/gocolly/colly"
    )
    
    const URL = "http://127.0.0.1:8000/"
    
    func main() {
        collector := colly.NewCollector(
            colly.AllowedDomains("127.0.0.1:8000"),
            colly.Async(true),
            colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
        )
    
        collector.Limit(&colly.LimitRule{
            DomainRegexp:  "127.0.0.1:8000",
            Parallelism: 2,
            Delay:       5 * time.Second,
        })
    
        collector.OnResponse(func(r *colly.Response) {
            url := r.Ctx.Get("url")
            fmt.Println("Done visiting", url)
        })
    
        collector.OnRequest(func(r *colly.Request) {
            r.Ctx.Put("url", r.URL.String())
            fmt.Println("Creating request for:", r.URL.String())
        })
    
        collector.OnError(func(r *colly.Response, err error) {
            fmt.Println(err)
        })
    
        for i := 0; i < 100; i++ {
            collector.Visit(URL + strconv.Itoa(i))
        }
    
        collector.Wait()
    }
    

    Note that I changed the RandomDelay to a regular one, which makes things easier to reason about for a test case, and I changed the debug statement for OnRequest.

    Now if you go run this file, you'll see that:

    1. it immediately prints Creating request for: http://127.0.0.1:8000/ + a number, 100 times
    2. it prints Done visiting http://127.0.0.1:8000/ + a number, twice
    3. the Python HTTP server prints 2 GET requests, 1 for each of the numbers in #2
    4. it pauses 5 seconds
    5. steps #2 - #4 repeat for the remaining numbers

    So it looks to me like colly is behaving as intended. If you're still getting rate limit errors that you don't expect, consider trying to verify that your limit rule is matching the domain.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 vue3页面el-table页面数据过多
  • ¥100 vue3中融入gRPC-web
  • ¥15 kali环境运行volatility分析android内存文件,缺profile
  • ¥15 写uniapp时遇到的问题
  • ¥15 vs 2008 安装遇到问题
  • ¥15 matlab有限元法求解梁带有若干弹簧质量系统的固有频率
  • ¥15 找一个网络防御专家,外包的
  • ¥100 能不能让两张不同的图片md5值一样,(有尝)
  • ¥15 informer代码训练自己的数据集,改参数怎么改
  • ¥15 请看一下,学校实验要求,我需要具体代码