ALLENNLP多卡训练梯度反传出错

问题：

2023-02-01 12:41:57,919 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/__main__.py", line 39, in run
    main(prog="allennlp")
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 120, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 186, in train_model_from_file
    return_model=return_model,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 341, in train_model
    nprocs=num_procs,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 508, in _train_worker
    metrics = train_loop.run()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 581, in run
    return self.trainer.train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 771, in train
    metrics, epoch = self._try_train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 510, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 403, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 221 222
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

梯度反传问题。根据报错描述，有两种解决办法。1. 排查无用参数；2. 设定find_unused_parameters=True。

我的尝试

1 排查无用参数
运行allennlp train之前，设定TORCH_DISTRIBUTED_DEBUG=DETAIL，打印出了未反传参数的名称，如下：

Parameters which did not receive grad for rank 0: word_embedder.token_embedder_bert.transformer_model.pooler.dense.weight, word_embedder.token_embedder_bert.transformer_model.pooler.dense.bias
Parameter indices which did not receive grad for rank 0: 221 222

这两个参数没有参与梯度反传。但是我加载这个预训练模型完全使用了allennlp的config：

"bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                }

不太清楚应当在哪里进行修改。
2 设定find_unused_parameters=True
然而我这里完全使用了allennlp train命令，分布式纯属是增加了如下config

"distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },

并没有人工手动调用dataparallel，也不太清楚如何给torch.nn.parallel.DistributedDataParallel传递参数（菜极了）

希望有相似经历的佬们提供一点帮助！

附完整config

{
    "random_seed": 42,
    "numpy_seed": 42,
    "pytorch_seed": 42,
    "dataset_reader": {
        "type": "rewrite",
        "lazy": false,
        "super_mode": "before",
        "joint_encoding": true,
        "use_bert": true,
        "language": "zh",
        "extra_stop_words": ["的", "是", "我", "了", "去"]
    },
    "model": {
        "type": "rewrite",
        "word_embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                    // "requires_grad": true
                }
            },
            // "allow_unmatched_keys": true,
            // "embedder_to_indexer_map": {
            //     "bert": [
            //        "bert",
            //        "bert-offsets",
            //        "bert-type-ids"
            //    ]
            // }
        },
        "text_encoder": {
            "type": "pytorch_transformer",
            "input_dim": 1152,
            "num_layers": 2,
            "positional_encoding": "sinusoidal"
        },
        "inp_drop_rate": 0.2,
        "out_drop_rate": 0.2,
        "feature_sel": 83,
        "loss_weights": [0.2, 0.2, 0.6],
        "super_mode": "before",
        "unet_down_channel": 64
    },
    "data_loader": {
        "batch_size": 12,
        "shuffle": true,
        "cuda_device": 1
    },
    "trainer": {
        "run_confidence_checks": 0,
        "num_epochs": 100,
        "patience": 10,
        "validation_metric": "+F3",
        // "cuda_device": 0,
        "optimizer": {
            "type": "adam",
            "parameter_groups": [
                [
                    [
                        ".*word_embedder.*", "text_encoder"
                    ],
                    {
                        "lr": 1e-5
                    }
                ]
            ],
            "lr": 1e-3
        },
        "learning_rate_scheduler": {
            "type": "reduce_on_plateau",
            "factor": 0.5,
            "mode": "max",
            "patience": 5
        },
        "num_serialized_models_to_keep": 10,
        "should_log_learning_rate": true
    },
    "distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },
}

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
yutianCHN 2023-02-03 09:07
关注
找到allennlp.nn.parallel.ddp_accelerator源码，将find_unused_parameters默认值改成True即可。

@DdpAccelerator.register("torch") class TorchDdpAccelerator(DdpAccelerator): | def __init__( | self, | *, find_unused_parameters: bool = False, # 改成True | *, local_rank: Optional[int] = None, | *, world_size: Optional[int] = None, | *, cuda_device: Union[torch.device, int] = -1 | ) -> None

目前不太清楚allennlp train的调用过程，理论上应该存在接口可以在不更改库函数的情况下修改这个参数。但是目前问题已经解决。
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

点云语义分割：Cylinder3D多卡训练全流程指南
2026-01-03 15:57

由于Cylinder3D模型结构较为复杂，其多卡训练过程的细节也较为繁琐，因此在实际操作过程中，用户往往需要参考相关的技术文档和社区分享的经验，以便更好地掌握多卡训练的技巧和方法。通过不断实践和调优，开发者能够...
简单多机多卡训练代码示例
2025-04-10 11:48

5. 同步机制：在多机多卡训练过程中，各个节点上的模型更新需要同步，通常采用梯度平均等策略来确保模型参数的一致性。 6. 训练循环：编写训练循环代码，包括前向传播、计算损失、反向传播、梯度下降等步骤。 7. ...
2-20horovod多机多卡训练环境配置+训练流程1
2022-08-08 21:03

在horovod下使用多机多卡需要满足以下3个先决条件：不同机器可以访问相同的文件：nfs不同机器使用相同的训练环境: Docker不同机器可以ssh交互：ss
Resnet实战：单机多卡DDP方式、混合精度训练
2022-04-15 16:21

本例提取了植物幼苗数据集中的部分数据做数据集，数据集共有12种类别，模型使用最经典的resnet50，演示如何实现混合精度训练以及如何使用DDP的方式实现多卡并行训练。通过本文你和学到： 1、如何使用混合精度训练...
YOLOv8多卡GPU训练配置教程：提升batch size效率
2025-12-31 16:24

数据冰山的博客掌握YOLOv8多GPU训练的关键配置与优化技巧，突破显存瓶颈，提升大batch训练效率。深入解析DDP机制、正确启动方式、学习率调整策略及常见问题排查，帮助用户高效利用多卡资源，实现更快收敛与更高精度。
大模型多卡训练原理
2024-01-17 16:20

南宫凝忆的博客大模型多卡训练原理
关于多卡训练和单卡推理
2025-09-03 13:12

文弱_书生的博客模型训练时使用多GPU（多卡）可以加速计算，因为训练需要处理海量数据，通过数据并行或模型并行可以显著提高效率。而推理时一般不推荐多卡，因为推理计算量小、批量处理少，多卡通信开销反而会增加延迟。此外，单GPU...
PyTorch多卡分布式训练DistributedDataParallel 使用方法
2022-02-07 11:15

AI吃大瓜的博客 Pytorch多卡训练有两种方式，一种是单进程多GPU训练模式(single process multi-gpus)，另一种的多进程多卡模式(multi-processes multi-gpus)。Pytorch通过nn.DataParallel可实现多卡训练模型（简称DP模式），这是...
lstm使用多个gpu训练_教程：如何在AllenNLP中训练多个GPU
2020-08-20 16:13

weixin_26726011的博客 lstm使用多个gpu训练This is part of a series of mini-tutorials to help you with various aspects of the AllenNLP library. 这是一系列迷你教程的一部分，这些教程可以帮助您了解AllenNLP库的各个方面。 ???? ...
ddp 多卡训练torch 记录
2023-07-13 16:55

Andy Dennis的博客之前一直拿别人的开源代码改，最近需要自己从头写，实验需要多卡训练，于是就记录一下。
多卡训练|PyTorch最简单的多卡训练方式
2025-01-15 14:47

闻道且行之的博客因为是多卡训练，DataParallel操作会对模型进行封装且改变键名，DataParallel在模型的键前添加了module前缀，我们只需要删除DataParallel容器即可。使用DataParallel的话会导致模型正常加载时会出现。
多机多卡训练和单机多卡训练速度对比
2025-03-04 21:45

colourmind的博客本文想对比一下多机多卡(多机之间没有采用高速网络通信设备互联仅仅采用普通的以太网网卡带宽100M/S)训练和单机多卡训练的速度对比，为将来有可能去实践多机多卡大模型训练奠定一定的基础。采用LLaMA-Factory来进行...
【昇腾】从单机单卡到单机多卡训练
2024-11-06 19:26

verse_armour的博客在每个训练步骤后，DDP自动同步各个进程计算出的梯度，确保所有进程的模型参数保持一致。这个采样器的设计目的是确保在分布式训练过程中，每个进程只处理数据集的一个子集，这样可以有效地利用多个进程和GPU来加速...
单机多卡训练-DDP
2023-12-17 23:04

不当菜鸡的程序媛的博客 DDP通过Ring-Reduce（梯度合并）的数据交换方法提高了通讯效率，并通过启动多个进程的方式减轻Python GIL的限制，从而提高训练速度。，建议在保存模型时，去除模型参数字典里面的module，如何去除呢，每一个epoch...
0736-极智开发-解读pytorch分布式多卡训练方式
2024-03-06 14:11

0736_极智开发_解读pytorch分布式多卡训练方式
pytorch 多卡训练 accelerate gloo
2023-12-30 10:21

AI算法网奇的博客 accelerate 多卡训练
yolo 多卡训练错误
2024-10-23 10:35

爱学习的章某的博客加入位置为：ultralytics-main/ultralytics/engine/trainer.py 中的第247行。
Pytorch 多卡训练原理与实现
2024-02-28 10:00

AI大模型教程的博客 Pytorch 多卡训练原理与实现
swift多卡并行训练微调qwen3-8B
2025-06-10 15:08

饮马长城窟的博客 ZeRO2将对优化器状态、模型梯度进行分片。ZeRO3在ZeRO2基础上，对模型参数进行分片，更加节约显存，但训练速度更慢。基础环境：docker-ubuntu, nvidia-ciotainer-toolkit。视同device_map同样存在这个情况。多卡的话...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 2月9日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 2月1日

ALLENNLP多卡训练梯度反传出错

ALLENNLP多卡训练梯度反传出错

问题：

我的尝试

希望有相似经历的佬们提供一点帮助！

3条回答 默认 最新

问题事件

3条回答默认最新