前几十个epoch没什么问题,可以正常训练。试了很多次,有时候四十几次就停止,不知道为什么。。
这是用服务器训练的,单卡多卡都会出现这样问题
但是用自己的电脑训练就不会这样
报错如下:
2021-10-22 16:07:42 | INFO | yolox.core.trainer:318 - Save weights to ./YOLOX_outputs/yolox_l
2021-10-22 16:07:43 | INFO | yolox.core.trainer:188 - ---> start train epoch75
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f564e880a22 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10983 (0x7f564eae1983 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f564eae3027 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f564e86a5a4 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa27e1a (0x7f56a53d4e1a in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa27eb1 (0x7f56a53d4eb1 in /home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1a6b5a (0x55f420004b5a in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #7: <unknown function> + 0x110cbc (0x55f41ff6ecbc in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #8: <unknown function> + 0x1105b9 (0x55f41ff6e5b9 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #9: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #10: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #11: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #12: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #13: <unknown function> + 0x1105a3 (0x55f41ff6e5a3 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x65b0 (0x55f42003a160 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #15: _PyEval_EvalCodeWithName + 0xd52 (0x55f42002af72 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #16: _PyFunction_Vectorcall + 0x594 (0x55f42002ba44 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #17: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #20: _PyFunction_Vectorcall + 0x534 (0x55f42002b9e4 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #21: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #24: PyObject_Call + 0x7d (0x55f41ff9587d in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1f0e (0x55f420035abe in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #26: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x4c0 (0x55f420034070 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #29: _PyFunction_Vectorcall + 0x534 (0x55f42002b9e4 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x4c0 (0x55f420034070 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #31: _PyFunction_Vectorcall + 0x1b7 (0x55f42002b667 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x71b (0x55f4200342cb in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #34: _PyFunction_Vectorcall + 0x594 (0x55f42002ba44 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x15a9 (0x55f420035159 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #36: _PyEval_EvalCodeWithName + 0x260 (0x55f42002a480 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #37: PyEval_EvalCode + 0x23 (0x55f42002bd33 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #38: <unknown function> + 0x2414a2 (0x55f42009f4a2 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #39: <unknown function> + 0x252292 (0x55f4200b0292 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #40: PyRun_StringFlags + 0x7a (0x55f4200b2eca in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #41: PyRun_SimpleStringFlags + 0x3c (0x55f4200b2f2c in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #42: Py_RunMain + 0x15b (0x55f4200b389b in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #43: Py_BytesMain + 0x39 (0x55f4200b3ce9 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
frame #44: __libc_start_main + 0xe7 (0x7f56a829ebf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #45: <unknown function> + 0x1f7847 (0x55f420055847 in /home/vision2021_meas/anaconda3/envs/yolox/bin/python)
Traceback (most recent call last):
File "tools/train.py", line 127, in <module>
launch(
File "/home/vision2021_meas/mycfhs/yolox/yolox/core/launch.py", line 82, in launch
mp.start_processes(
File "/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with signal SIGABRT
/home/vision2021_meas/anaconda3/envs/yolox/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 149 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '