限制gocolly一次处理有限数量的网址

I am trying to use gocolly's Parallelism setting to throttle scraping a maximum number of URLs at a time.

Using the code I've pasted below, I am getting this output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

Which shows that the visits are not blocking with the max number of threads given. When adding more URLs, they are sent all together resulting in a ban from the server.

How can I configure the library to get the following output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv

Here is the code:

const (
    letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    URL = "https://www.google.com/search?q="
)

func RandStringBytes(n int) chan string {
    out := make(chan string)
    quit := make(chan int)

    go func() { 
        for i := 1; i <= 5; i++ {
            b := make([]byte, n)
            for i := range b {
                b[i] = letterBytes[rand.Intn(len(letterBytes))]
            }
            out <- string(b)
        }
        close(out)
        quit <- 0
    }()
    return out
}

func main() {
    c := RandStringBytes(6) 
    collector := colly.NewCollector(
        colly.AllowedDomains("www.google.com"),
        colly.Async(true),
        colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
    )   

    collector.Limit(&colly.LimitRule{
        DomainRegexp: "www.google.com",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })
    collector.OnResponse(func(r *colly.Response) {
        url := r.Ctx.Get("url")
        fmt.Println("Done visiting", url)
    })
    collector.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
        fmt.Println("Visiting", r.URL.String())
    })
    collector.OnError(func(r *colly.Response, err error) {
        fmt.Println(err)
    })

    for w := range c {
        collector.Visit(URL+w)
    }

    collector.Wait()
}


Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

展开全部

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douluogu8713 2018-07-20 15:09
关注
OnRequest is done before the request is actually sent to the server. Your debug statement is misleading: fmt.Println("Visiting", r.URL.String()) should probably be: fmt.Println("Preparing request for:", r.URL.String()).

I thought your question was interesting, so I set up a local test case with python's http.server like so:

$ cd $(mktemp -d) # make temp dir $ for n in {0..99}; do touch $n; done # make 100 empty files $ python3 -m http.server # start up test server

Then modify your code above:

package main import ( "fmt" "strconv" "time" "github.com/gocolly/colly" ) const URL = "http://127.0.0.1:8000/" func main() { collector := colly.NewCollector( colly.AllowedDomains("127.0.0.1:8000"), colly.Async(true), colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"), ) collector.Limit(&colly.LimitRule{ DomainRegexp: "127.0.0.1:8000", Parallelism: 2, Delay: 5 * time.Second, }) collector.OnResponse(func(r *colly.Response) { url := r.Ctx.Get("url") fmt.Println("Done visiting", url) }) collector.OnRequest(func(r *colly.Request) { r.Ctx.Put("url", r.URL.String()) fmt.Println("Creating request for:", r.URL.String()) }) collector.OnError(func(r *colly.Response, err error) { fmt.Println(err) }) for i := 0; i < 100; i++ { collector.Visit(URL + strconv.Itoa(i)) } collector.Wait() }

Note that I changed the RandomDelay to a regular one, which makes things easier to reason about for a test case, and I changed the debug statement for OnRequest.

Now if you go run this file, you'll see that:

it immediately prints Creating request for: http://127.0.0.1:8000/ + a number, 100 times

it prints Done visiting http://127.0.0.1:8000/ + a number, twice

the Python HTTP server prints 2 GET requests, 1 for each of the numbers in #2

it pauses 5 seconds

steps #2 - #4 repeat for the remaining numbers

So it looks to me like colly is behaving as intended. If you're still getting rate limit errors that you don't expect, consider trying to verify that your limit rule is matching the domain.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

编辑

预览

报告相同问题？

关注问题

gocolly简介
2019-02-03 13:38

应该是只菜鸟的博客 gocolly是使用golang实现的一个爬虫库。之前在爬某些网页的时候做过简单应用，最近在爬某电商网站的时候，发现关于这个还挺有意思，所以趁年前有时间，看了下源码实现。 gocolly内部依赖的是http.Client与goquery。...
「GoCN酷Go推荐」Go 语言爬虫神器 gocolly/colly
2021-10-26 03:24

Go中国的博客小结 gocolly/colly 目前 Github Go 爬虫类包中 Star 数量最多，满足日常网络爬虫业务需求，使用很方便，也可以在改包基础上功能扩展开发满足更多个性化需求。相关资料 gocolly/colly 官方仓库 gocolly/colly 包...
Go 并发请求量限制组件分享
2022-01-11 13:07

想搞艺术的程序员的博客设置一个最大的请求处理数量，当请求超过时，后续请求将等待，直到有请求处理完后被唤醒。请求的等待时间能够指定，超出等待时间就返回，提示给客户端。等待请求的个数需要能够限制，数量超过时就直接返回，提示给...
非主流？论Go语言爬虫的必要性！
2025-03-22 17:46

菩提树下呀的博客根据 GitHub 上开源爬虫项目的数量，Python 占比约 63%，JavaScript (Node.js) 约 22%，其他语言如 Ruby、Java 等各占 3%-4%，C/C++/C# 合计约 3%，Go 和 PHP 各约 3% 和 2%。数据基于开源项目，可能不完全代表闭源...
go语言爬虫教程python_写爬虫还在用 python？快来试试 go 语言的爬虫框架吧
2020-12-03 17:02

weixin_39649614的博客今天为大家介绍的是一款 go 语言...go get -u github.com/gocolly/colly/...其次，构建 Collector，添加事件，然后访问：package mainimport ("fmt""github.com/gocolly/colly")func main() {// 初始化 collyc := ...
Go的Regexp
2018-08-17 03:06

思维小刀的博客 func Compile(expr string) (*Regexp, error) // 将正则表达式编译成一个正则对象（正则语法限制在 POSIX ERE 范围内）。 // 该正则对象会采用“leftmost-longest”模式。选择最长的匹配结果。 // POSIX 语法不支持...
golang大厂面试1
2023-06-11 13:42

theo.wu的博客 Golang字节面试经验分享第一面面试官首先介绍说会有几轮面试算法题 1.1将整数转换二进制然后将负数变成。
爬虫框架整理汇总
2018-06-11 08:46

weixin_34281537的博客整理了Node.js、PHP、Go、JAVA、Ruby、Python等语言的爬虫框架。不知道读者们都用过什么爬虫框架？爬虫框架的哪些点你觉得好？哪些点觉得不好？ Node.js node-crawler ...
awesome-go
2023-12-12 03:36

码刀攻城的博客 go的开发辅助工具
没有解决我的问题, 去提问

限制gocolly一次处理有限数量的网址

1条回答 默认 最新

1条回答默认最新