dtx9931 2017-10-31 06:50
浏览 84

如何使用HTTP提取共享会话并将其存储到S3操作?

I need to fetch the contents from many several URLs and store it in AWS S3. I've written a function to do that which works. But I am looking to make it faster and more efficient by re-using http client connection and re-using the AWS session. Furthermore I'm looking to get them to run concurrently, say 5 at a time.

func fetchPut(fromURL string, toS3 string) error {

      start := time.Now()
      resp, err := http.Get(fromURL)
      if err != nil {
          return err
      }
      defer resp.Body.Close()

      sess := session.Must(session.Must(session.NewSession()))
      s3svc := s3.New(sess)

      s3URL, _ := url.Parse(toS3)

      byteArray, _ := ioutil.ReadAll(resp.Body)
      fetchElapsed := time.Since(start).Seconds()

      start = time.Now()
      input := &s3.PutObjectInput{
          Body:         bytes.NewReader(byteArray),
          Bucket:       aws.String(s3URL.Host),
          Key:          aws.String(s3URL.Path),
      }
      _, err = s3svc.PutObject(input)
      putElapsed := time.Since(start).Seconds()

      return err
}

What I don't understand is how I can re-use the session (both http & AWS). Can I have it in some global variable? Or do I have to create some sort of context?

Are there any good examples of this sort of use case to study?

  • 写回答

1条回答 默认 最新

  • dongyou1926 2017-10-31 13:32
    关注

    Your problem seems to be pretty general.

    As a principle you need to separate things which don't change (session & AWS service object, destination non-varying part like the bucket name) from the ones which change (src, dest. varying part like the key name), then setup non-changing configuration once, then run URL fetch + S3 store concurrently, passing your config as an additional arg.

    That would boil down to moving your s3svc creation out of fetchPut function and passing it as an arg, then running fetchPutin goroutines, possibly with using async.WaitGroup if you want to wait for all of them to finish.

    Other variation would be to run two pools of workers: producers (fetching URLs) and consumers (putting to S3) and use a channel to inform that one can feed another. That would probably give most of speedup.

    In general, I agree with your idea of making it concurrent - it's pretty good mind-stretching example; doesn't have to be considered as premature optimization. I also can't resist advertising Rob Pike's excellent talk "Concurrency Is Not Parallelism". Rob's example of a load balancer is more complicated than your case, still gives a good overview how to process requests concurrently.

    Btw, "session" used for http fetch is kind of transparent; as the commenters already mentioned, http client from standard library will be reused and you don't have to worry about that.

    评论

报告相同问题?

悬赏问题

  • ¥15 表达式必须是可修改的左值
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊
  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写
  • ¥20 Qcustomplot缩小曲线形状问题