最近在调试MoCo v2代码,在分布式训练的代码运行时一直报错,报错如下,求大伙儿帮我看看是哪里出问题啦
Traceback (most recent call last):
File "main_moco.py", line 530, in <module>
main()
File "main_moco.py", line 223, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/root/miniconda3/envs/xzb/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/miniconda3/envs/xzb/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/xzb/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/fengyong/xzb/main_moco.py", line 254, in main_worker
rank=args.rank
File "/root/miniconda3/envs/xzb/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/root/miniconda3/envs/xzb/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 88, in _tcp_rendezvous_handler
raise _error("rank parameter missing")
ValueError: Error initializing torch.distributed using tcp:// rendezvous: rank parameter missing