douwen3836 2017-11-11 18:43
浏览 27
已采纳

gitlab通过golang刮刮的麻烦

I'm newbie in programming and I need help. Trying to write gitlab scraper on golang. Something goes wrong when i'm trying to get information about projects in multithreading mode.

Here is the code:

func (g *Gitlab) getAPIResponce(url string, structure interface{}) error {
    responce, responce_error := http.Get(url)
    if responce_error != nil {
        return responce_error
    }
    ret, _ := ioutil.ReadAll(responce.Body)
    if string(ret) != "[]" {
        err := json.Unmarshal(ret, structure)
        return err
    }
    return errors.New(error_emptypage)
}

...

func (g *Gitlab) GetProjects() {
    projects_chan := make(chan Project, g.LatestProjectID) 
    var waitGroup sync.WaitGroup                           
    queue := make(chan struct{}, 50)                                      
    for i := g.LatestProjectID; i > 0; i-- {               
        url := g.BaseURL + projects_url + "/" + strconv.Itoa(i) + g.Token
        waitGroup.Add(1)
        go func(url string, channel chan Project) {
            queue <- struct{}{}
            defer waitGroup.Done()

            var oneProject Project
            err := g.getAPIResponce(url, &oneProject)
            if err != nil {
                fmt.Println(err.Error())
            }

            fmt.Printf(".")
            channel <- oneProject
            <-queue
        }(url, projects_chan)
    }

    go func() {
        waitGroup.Wait()
        close(projects_chan)
    }()

    for project := range projects_chan {
        if project.ID != 0 {
            g.Projects = append(g.Projects, project)
        }
    }
}

And here is the output:

$ ./gitlab-auditor 
latest project = 1532
Gathering projects...
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Get https://gitlab.example.com/api/v4/projects/563&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/558&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/531&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/571&private_token=SeCrEt_ToKeN: unexpected EOF
.Get https://gitlab.example.com/api/v4/projects/570&private_token=SeCrEt_ToKeN: unexpected EOF
..Get https://gitlab.example.com/api/v4/projects/467&private_token=SeCrEt_ToKeN: unexpected EOF
Get https://gitlab.example.com/api/v4/projects/573&private_token=SeCrEt_ToKeN: unexpected EOF
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Every time it's different projects, but it's id is around 550.

When I'm trying to curl links from output, i'm getting normal JSON. When I'm trying to run this code with queue := make(chan struct{}, 1) (in single thread) - everything is fine.

What can it be?

  • 写回答

1条回答 默认 最新

  • ds20021205 2017-12-07 22:59
    关注

    i would say this not a very clear way to achieve concurrency. what seems to be happening here is

    • you create a buffered channel that has a size of 50.

    • then you fire up 1532 goroutines

    • the first 50 of them enqueue themselves and start processing. by the time they <-queue and free up somespace a random one from the next manages to get on the queue.

    • as people say in the comments most certainly you hit some limits around the time it the blast has made it around id 550. Then gitlab's API is angry at you and rate limits.

    • then another goroutine is fired that will close the channel to notify the main goroutine

    • the main goroutine reads messages.

    the talk go concurrency patterns as well as this blog post concurrency in go might help. personally i rarely use buffered channels. for your problem i would go like:

    • define a number of workers

    • have the main goroutine fire up the workers with a func listening on a channel of ints , doing the api call, writing to a channel of projects

    • have the main goroutine send to a channel of ints the project number to be fetched and read from the channel of projects.

      • maybe ratelimit by firing a ticker and have main read from it before it sends the next request?
    • main closes the number channel to notify the others to die.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题