duandu2980 2015-12-17 16:16
浏览 18

无需外部依赖的高性能网络蜘蛛

I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.

The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())

    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)

        go func() {
            defer wg.Done()

            // do the expensive work here - query db, crawl domain, inspect html
        }()
    }
    wg.Wait()
}

func main() {
    go crawl(1)

    select{}
}

Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding :-) If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.

I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.

  • 写回答

1条回答 默认 最新

  • dsd119120 2015-12-17 17:19
    关注

    Why have the timing component at all?

    Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.

    then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)

    something like this (on Play):

    package main
    
    import "fmt"
    import "sync"
    
    var numWorkers = 10
    
    func crawler(urls chan string, wg *sync.WaitGroup) {
        defer wg.Done()
        for u := range urls {
            fmt.Println(u)
        }
    }
    func main() {
        ch := make(chan string)
        var wg sync.WaitGroup
        for i := 0; i < numWorkers; i++ {
            wg.Add(1)
            go crawler(ch, &wg)
        }
        ch <- "http://ibm.com"
        ch <- "http://google.com"
        close(ch)
        wg.Wait()
        fmt.Println("All Done")
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用
  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?