yutianCHN 2023-02-01 21:12 采纳率: 0%
浏览 105
已结题

ALLENNLP多卡训练梯度反传出错

ALLENNLP多卡训练梯度反传出错
问题:
2023-02-01 12:41:57,919 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/__main__.py", line 39, in run
    main(prog="allennlp")
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 120, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 186, in train_model_from_file
    return_model=return_model,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 341, in train_model
    nprocs=num_procs,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 508, in _train_worker
    metrics = train_loop.run()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 581, in run
    return self.trainer.train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 771, in train
    metrics, epoch = self._try_train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 510, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 403, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 221 222
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

梯度反传问题。根据报错描述,有两种解决办法。1. 排查无用参数;2. 设定find_unused_parameters=True。

我的尝试

1 排查无用参数
运行allennlp train之前,设定TORCH_DISTRIBUTED_DEBUG=DETAIL,打印出了未反传参数的名称,如下:

Parameters which did not receive grad for rank 0: word_embedder.token_embedder_bert.transformer_model.pooler.dense.weight, word_embedder.token_embedder_bert.transformer_model.pooler.dense.bias
Parameter indices which did not receive grad for rank 0: 221 222

这两个参数没有参与梯度反传。但是我加载这个预训练模型完全使用了allennlp的config:

"bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                }

不太清楚应当在哪里进行修改。
2 设定find_unused_parameters=True
然而我这里完全使用了allennlp train命令,分布式纯属是增加了如下config

"distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },

并没有人工手动调用dataparallel,也不太清楚如何给torch.nn.parallel.DistributedDataParallel传递参数(菜极了)

希望有相似经历的佬们提供一点帮助!

附完整config

{
    "random_seed": 42,
    "numpy_seed": 42,
    "pytorch_seed": 42,
    "dataset_reader": {
        "type": "rewrite",
        "lazy": false,
        "super_mode": "before",
        "joint_encoding": true,
        "use_bert": true,
        "language": "zh",
        "extra_stop_words": ["的", "是", "我", "了", "去"]
    },
    "model": {
        "type": "rewrite",
        "word_embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                    // "requires_grad": true
                }
            },
            // "allow_unmatched_keys": true,
            // "embedder_to_indexer_map": {
            //     "bert": [
            //        "bert",
            //        "bert-offsets",
            //        "bert-type-ids"
            //    ]
            // }
        },
        "text_encoder": {
            "type": "pytorch_transformer",
            "input_dim": 1152,
            "num_layers": 2,
            "positional_encoding": "sinusoidal"
        },
        "inp_drop_rate": 0.2,
        "out_drop_rate": 0.2,
        "feature_sel": 83,
        "loss_weights": [0.2, 0.2, 0.6],
        "super_mode": "before",
        "unet_down_channel": 64
    },
    "data_loader": {
        "batch_size": 12,
        "shuffle": true,
        "cuda_device": 1
    },
    "trainer": {
        "run_confidence_checks": 0,
        "num_epochs": 100,
        "patience": 10,
        "validation_metric": "+F3",
        // "cuda_device": 0,
        "optimizer": {
            "type": "adam",
            "parameter_groups": [
                [
                    [
                        ".*word_embedder.*", "text_encoder"
                    ],
                    {
                        "lr": 1e-5
                    }
                ]
            ],
            "lr": 1e-3
        },
        "learning_rate_scheduler": {
            "type": "reduce_on_plateau",
            "factor": 0.5,
            "mode": "max",
            "patience": 5
        },
        "num_serialized_models_to_keep": 10,
        "should_log_learning_rate": true
    },
    "distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },
}
  • 写回答

3条回答 默认 最新

  • yutianCHN 2023-02-03 09:07
    关注

    找到allennlp.nn.parallel.ddp_accelerator源码,将find_unused_parameters默认值改成True即可。

    @DdpAccelerator.register("torch")
    class TorchDdpAccelerator(DdpAccelerator):
     | def __init__(
     |     self,
     |     *, find_unused_parameters: bool = False, # 改成True
     |     *, local_rank: Optional[int] = None,
     |     *, world_size: Optional[int] = None,
     |     *, cuda_device: Union[torch.device, int] = -1
     | ) -> None
    
    

    目前不太清楚allennlp train的调用过程,理论上应该存在接口可以在不更改库函数的情况下修改这个参数。但是目前问题已经解决。

    评论

报告相同问题?

问题事件

  • 系统已结题 2月9日
  • 创建了问题 2月1日