nccl-test,overrun怎么排查。
具体描述,在运行的时候,ring算法能正常运行, 但是tree算法中小于1M的数据会overrun,也就是卡死。
报错如下:
[worker03:29830] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[worker03:29830] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[worker03:29830] 7 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[worker03:29830] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init
worker03: Test CUDA failure all_reduce.cu:36 'unknown error'
.. worker03 pid 29866: Test failure common.cu:589
.. worker03 pid 29866: Test failure common.cu:711
.. worker03 pid 29866: Test failure all_reduce.cu:101
.. worker03 pid 29866: Test failure common.cu:725
.. worker03 pid 29866: Test failure common.cu:1166
.. worker03 pid 29866: Test failure common.cu:964
worker03: Test CUDA failure all_reduce.cu:36 'unknown error'
.. worker03 pid 29865: Test failure common.cu:589
.. worker03 pid 29865: Test failure common.cu:711
.. worker03 pid 29865: Test failure all_reduce.cu:101
.. worker03 pid 29865: Test failure common.cu:725
.. worker03 pid 29865: Test failure common.cu:1166
.. worker03 pid 29865: Test failure common.cu:964
worker03: Test CUDA failure all_reduce.cu:36 'unknown error'
.. worker03 pid 29864: Test failure common.cu:589
.. worker03 pid 29864: Test failure common.cu:711
.. worker03 pid 29864: Test failure all_reduce.cu:101
.. worker03 pid 29864: Test failure common.cu:725
.. worker03 pid 29864: Test failure common.cu:1166
.. worker03 pid 29864: Test failure common.cu:964
worker03:29870:29870 [5] NCCL INFO ncclEnqueueCheck isAsync:0
不需要给出原因分析。请告诉我应该用哪些方式排查,例如在哪些文件里面添加注释和中断