duansaoguan7955 2018-06-28 19:02
浏览 214
已采纳

限制gocolly一次处理有限数量的网址

I am trying to use gocolly's Parallelism setting to throttle scraping a maximum number of URLs at a time.

Using the code I've pasted below, I am getting this output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

Which shows that the visits are not blocking with the max number of threads given. When adding more URLs, they are sent all together resulting in a ban from the server.

How can I configure the library to get the following output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv

Here is the code:

const (
    letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    URL = "https://www.google.com/search?q="
)

func RandStringBytes(n int) chan string {
    out := make(chan string)
    quit := make(chan int)

    go func() { 
        for i := 1; i <= 5; i++ {
            b := make([]byte, n)
            for i := range b {
                b[i] = letterBytes[rand.Intn(len(letterBytes))]
            }
            out <- string(b)
        }
        close(out)
        quit <- 0
    }()
    return out
}

func main() {
    c := RandStringBytes(6) 
    collector := colly.NewCollector(
        colly.AllowedDomains("www.google.com"),
        colly.Async(true),
        colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
    )   

    collector.Limit(&colly.LimitRule{
        DomainRegexp: "www.google.com",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })
    collector.OnResponse(func(r *colly.Response) {
        url := r.Ctx.Get("url")
        fmt.Println("Done visiting", url)
    })
    collector.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
        fmt.Println("Visiting", r.URL.String())
    })
    collector.OnError(func(r *colly.Response, err error) {
        fmt.Println(err)
    })

    for w := range c {
        collector.Visit(URL+w)
    }

    collector.Wait()
}


Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

展开全部

  • 写回答

1条回答 默认 最新

  • douluogu8713 2018-07-20 15:09
    关注

    OnRequest is done before the request is actually sent to the server. Your debug statement is misleading: fmt.Println("Visiting", r.URL.String()) should probably be: fmt.Println("Preparing request for:", r.URL.String()).

    I thought your question was interesting, so I set up a local test case with python's http.server like so:

    $ cd $(mktemp -d) # make temp dir
    $ for n in {0..99}; do touch $n; done # make 100 empty files
    $ python3 -m http.server # start up test server
    

    Then modify your code above:

    package main
    
    import (
        "fmt"
        "strconv"
        "time"
    
        "github.com/gocolly/colly"
    )
    
    const URL = "http://127.0.0.1:8000/"
    
    func main() {
        collector := colly.NewCollector(
            colly.AllowedDomains("127.0.0.1:8000"),
            colly.Async(true),
            colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
        )
    
        collector.Limit(&colly.LimitRule{
            DomainRegexp:  "127.0.0.1:8000",
            Parallelism: 2,
            Delay:       5 * time.Second,
        })
    
        collector.OnResponse(func(r *colly.Response) {
            url := r.Ctx.Get("url")
            fmt.Println("Done visiting", url)
        })
    
        collector.OnRequest(func(r *colly.Request) {
            r.Ctx.Put("url", r.URL.String())
            fmt.Println("Creating request for:", r.URL.String())
        })
    
        collector.OnError(func(r *colly.Response, err error) {
            fmt.Println(err)
        })
    
        for i := 0; i < 100; i++ {
            collector.Visit(URL + strconv.Itoa(i))
        }
    
        collector.Wait()
    }
    

    Note that I changed the RandomDelay to a regular one, which makes things easier to reason about for a test case, and I changed the debug statement for OnRequest.

    Now if you go run this file, you'll see that:

    1. it immediately prints Creating request for: http://127.0.0.1:8000/ + a number, 100 times
    2. it prints Done visiting http://127.0.0.1:8000/ + a number, twice
    3. the Python HTTP server prints 2 GET requests, 1 for each of the numbers in #2
    4. it pauses 5 seconds
    5. steps #2 - #4 repeat for the remaining numbers

    So it looks to me like colly is behaving as intended. If you're still getting rate limit errors that you don't expect, consider trying to verify that your limit rule is matching the domain.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
编辑
预览

报告相同问题?