weixin_39913422
weixin_39913422
2020-12-02 08:26

What is GPU memory size of your V100? (ERROR: Unexpected bus error encountered in worker)

HI, I am trying to use one V100 GPU with 16G memory to run the fine-tuning on COCO image captioning task and always encounter such error "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)". So what is the GPU memory of your V100 and what are the recommended configurations (e.g., batch_size, num_workers) on my cluster with 4*V100 of 16G memory for running fine-tuning for COCO captioning. Thanks!

该提问来源于开源项目:LuoweiZhou/VLP

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

10条回答

  • weixin_39966376 weixin_39966376 5月前

    Sorry I missed your question. shm-size=8G should generally work well.

    点赞 评论 复制链接分享
  • weixin_39966376 weixin_39966376 5月前

    -cuhk There is a "None gradient" issue with the current code when using --fp16. A temporary (hacky) solution is here: https://github.com/NVIDIA/apex/issues/131#issuecomment-458859777

    点赞 评论 复制链接分享
  • weixin_39966376 weixin_39966376 5月前

    -cuhk We also have V100 with 16G RAM memory. Are you using the docker image we provided? Seems you need to increase the shm-size. You can leave the batch_size, lr by default. You might want to reduce num_workers if you keep getting the shm issue.

    点赞 评论 复制链接分享
  • weixin_39913422 weixin_39913422 5月前

    Hi, I do not use the docker image you provide. Instead, I build another docker image using your provided Dockerfile and pull the image into google cloud platform for use. (it seems there is no shm_size parameter involved during this process). After reducing num_workers into 0, the bus error disappears but a cuda memory error occurs (RuntimeError: CUDA error: out of memory). This error keeps happening even though I decrease the batch_size into 32. Do you have any suggestions for my issues? Also, I would like to ask whether you set fp16=True and if yes, would it reduce the overall performance? Thanks!

    点赞 评论 复制链接分享
  • weixin_39913422 weixin_39913422 5月前

    Oh, I have figured out this problem. When I close the running command in CLI by ctrl z, the GPU memory has not been released! (it may be caused by the cache process in your program). After manually kill these hidden processes with kill -9 PID, now the problem is solved. So my final question would be: how much will fp16 affect the performance? :)

    点赞 评论 复制链接分享
  • weixin_39966376 weixin_39966376 5月前

    -cuhk Good to know that! To avoid the "zombies", I personally recommend having a bash file to store the training commands and running it in a tmux or screen session. To kill all the programs, make sure you kill the session directly.

    I had a few preliminary experiments a while back on fp16 and observed 1-2% inferior results on COCO (w/o pre-training). Since then the option is not updated. Let me know if you find anything new and welcome to send a PR.

    点赞 评论 复制链接分享
  • weixin_39913422 weixin_39913422 5月前

    Thanks for your quick response and recommendation!

    点赞 评论 复制链接分享
  • weixin_39913422 weixin_39913422 5月前

    Hi, I run the fune-tuning on COCO caption of a cluster of 6 * V100 GPU with 16G memory and find that each epoch takes about 80min, which is much slower than the speed of 12min/epoch stated in the paper (though the number of GPU is fewer than 8 * V100 in your settings). What are the possible reasons for this big training speed drop? Also, I see that you add --amp in the fine-tuning command but after I inspect the code run_img2txt_dist.py, I find that amp is only valid when --fp16 is True. Am I correct? (Since I doubt that my slower fine-tuning is due to the lack of mixture precision training by apex.amp)

    点赞 评论 复制链接分享
  • weixin_39966376 weixin_39966376 5月前

    -cuhk Yes, in our experiments we didn't use --fp16 like I mentioned earlier so --amp is redundant and needs to be removed. We use 8x V100 GPUs with NVLinks and I suspect this causes the speed difference.

    点赞 评论 复制链接分享
  • weixin_39946029 weixin_39946029 5月前

    Is there a rule of thumb for setting shm-size? How is it related to #GPUs used for training?

    点赞 评论 复制链接分享

相关推荐