dpd7195 2019-07-22 21:55
浏览 89
已采纳

如何在AWS开发工具包中实现AWS CLI Sync Command的性能

The aws s3 sync command in the CLI can download a large collection of files very quickly, and I can not achieve the same performance with the AWS Go SDK. I have millions of files in the bucket so this is critical to me. I need to use the list pages command as well so that I can add a prefix which is not supported well by the sync CLI command.

I have tried using multiple goroutines (10 up to 1000) to make requests to the server, but the time is just so much slower compared to the CLI. It takes about 100 ms per file to run the Go GetObject function which is unacceptable for the number of files that I have. I know that the AWS CLI also uses the Python SDK in the backend, so how does it have so much better performance (I tried my script in boto as well as Go).

I am using ListObjectsV2Pages and GetObject. My region is the same as the S3 server's.

    logMtx := &sync.Mutex{}
    logBuf := bytes.NewBuffer(make([]byte, 0, 100000000))

    err = s3c.ListObjectsV2Pages(
        &s3.ListObjectsV2Input{
            Bucket:  bucket,
            Prefix:  aws.String("2019-07-21-01"),
            MaxKeys: aws.Int64(1000),
        },
        func(page *s3.ListObjectsV2Output, lastPage bool) bool {
            fmt.Println("Received", len(page.Contents), "objects in page")
            worker := make(chan bool, 10)
            for i := 0; i < cap(worker); i++ {
                worker <- true
            }
            wg := &sync.WaitGroup{}
            wg.Add(len(page.Contents))
            objIdx := 0
            objIdxMtx := sync.Mutex{}
            for {
                <-worker
                objIdxMtx.Lock()
                if objIdx == len(page.Contents) {
                    break
                }
                go func(idx int, obj *s3.Object) {
                    gs := time.Now()
                    resp, err := s3c.GetObject(&s3.GetObjectInput{
                        Bucket: bucket,
                        Key:    obj.Key,
                    })
                    check(err)
                    fmt.Println("Get: ", time.Since(gs))

                    rs := time.Now()
                    logMtx.Lock()
                    _, err = logBuf.ReadFrom(resp.Body)
                    check(err)
                    logMtx.Unlock()
                    fmt.Println("Read: ", time.Since(rs))

                    err = resp.Body.Close()
                    check(err)
                    worker <- true
                    wg.Done()
                }(objIdx, page.Contents[objIdx])
                objIdx += 1
                objIdxMtx.Unlock()
            }
            fmt.Println("ok")
            wg.Wait()
            return true
        },
    )
    check(err)

Many results look like:

Get:  153.380727ms
Read:  51.562µs
  • 写回答

2条回答 默认 最新

  • doutonghang2761 2019-07-31 17:57
    关注

    I ended up settling for my script in the initial post. I tried 20 goroutines and that seemed to work pretty well. On my laptop, the initial script is definitely slower than the command line (i7 8-thread, 16 GB RAM, NVME) versus the CLI. However, on the EC2 instance, the difference was small enough that it was not worth my time to optimize it further. I used a c5.xlarge instance in the same region as the S3 server.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 我想在一个软件里添加一个优惠弹窗,应该怎么写代码
  • ¥15 fluent的在模拟压强时使用希望得到一些建议
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services
  • ¥15 关于#c语言#的问题:我现在在做一个墨水屏设计,2.9英寸的小屏怎么换4.2英寸大屏
  • ¥15 模糊pid与pid仿真结果几乎一样
  • ¥15 java的GUI的运用
  • ¥15 Web.config连不上数据库
  • ¥15 我想付费需要AKM公司DSP开发资料及相关开发。
  • ¥15 怎么配置广告联盟瀑布流