沐衣橙风 2021-10-22 16:46 采纳率: 0%
浏览 2433
已结题

训练yolox_l时经过几十个epoch后报错:CUDA error: an illegal memory access was encountered

前几十个epoch没什么问题,可以正常训练。试了很多次,有时候四十几次就停止,不知道为什么。。
这是用服务器训练的,单卡多卡都会出现这样问题
但是用自己的电脑训练就不会这样
报错如下:


2021-10-22 16:07:42 | INFO     | yolox.core.trainer:318 - Save weights to ./YOLOX_outputs/yolox_l
2021-10-22 16:07:43 | INFO     | yolox.core.trainer:188 - ---> start train epoch75
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f564e880a22 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10983 (0x7f564eae1983 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f564eae3027 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f564e86a5a4 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa27e1a (0x7f56a53d4e1a in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa27eb1 (0x7f56a53d4eb1 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1a6b5a (0x55f420004b5a in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #7: <unknown function> + 0x110cbc (0x55f41ff6ecbc in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #8: <unknown function> + 0x1105b9 (0x55f41ff6e5b9 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #9: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #10: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #11: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #12: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #13: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x65b0 (0x55f42003a160 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #15: _PyEval_EvalCodeWithName + 0xd52 (0x55f42002af72 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #16: _PyFunction_Vectorcall + 0x594 (0x55f42002ba44 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #17: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #20: _PyFunction_Vectorcall + 0x534 (0x55f42002b9e4 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #21: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #24: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #26: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x4c0 (0x55f420034070 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #29: _PyFunction_Vectorcall + 0x534 (0x55f42002b9e4 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x4c0 (0x55f420034070 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #31: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x71b (0x55f4200342cb in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #34: _PyFunction_Vectorcall + 0x594 (0x55f42002ba44 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x15a9 (0x55f420035159 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #36: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #37: PyEval_EvalCode + 0x23 (0x55f42002bd33 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #38: <unknown function> + 0x2414a2 (0x55f42009f4a2 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #39: <unknown function> + 0x252292 (0x55f4200b0292 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #40: PyRun_StringFlags + 0x7a (0x55f4200b2eca in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #41: PyRun_SimpleStringFlags + 0x3c (0x55f4200b2f2c in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #42: Py_RunMain + 0x15b (0x55f4200b389b in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #43: Py_BytesMain + 0x39 (0x55f4200b3ce9 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #44: __libc_start_main + 0xe7 (0x7f56a829ebf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #45: <unknown function> + 0x1f7847 (0x55f420055847 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)

Traceback (most recent call last):
  File "tools/train.py", line 127, in <module>
    launch(
  File "/home/vision2021_meas/mycfhs/yolox/yolox/core/launch.py", line 82, in launch
    mp.start_processes(
  File "/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with signal SIGABRT
/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 149 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
  • 写回答

3条回答 默认 最新

  • 爱晚乏客游 2021-10-22 17:07
    关注

    你用的官网的代码吗?如果是的话加上 -o参数看看。
    多显卡训练出现这个问题的吗还是啥?或者是显存不足,你可以试试看下把batch size改小点或者将图片输入改小点试试看。
    还有一个就是使用resume参数将从上次停止的地方继续训练。如果都不行的话就只能去官网下面提这个问题,详细说明环境,参数设置等,看下作者有什么方法解决没有。

    img

    评论

报告相同问题?

问题事件

  • 已结题 (查看结题原因) 10月26日
  • 修改了问题 10月22日
  • 创建了问题 10月22日

悬赏问题

  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?
  • ¥15 求daily translation(DT)偏差订正方法的代码
  • ¥15 js调用html页面需要隐藏某个按钮
  • ¥15 ads仿真结果在圆图上是怎么读数的
  • ¥20 Cotex M3的调试和程序执行方式是什么样的?
  • ¥20 java项目连接sqlserver时报ssl相关错误
  • ¥15 一道python难题3