单机多卡训练RuntimeError: Expected all tensors to be on the same device, but found at least two devices

请问想使用deepspeed进行单机多卡的finetune时，报错：

Traceback (most recent call last):
  File "main.py", line 440, in <module>
    main()
  File "main.py", line 397, in main
    perplexity = evaluation(model, eval_dataloader)
  File "main.py", line 323, in evaluation
    outputs = model(**batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1695, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2213, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

然后另一块卡报的就是

cuda:1 and cpu!

查了两天了，知道是因为一部分数据在cpu上，一部分在gpu上，但是哪一块在cpu上，我又怎么挪过去呢？

到最终报错的torch/nn/functional.py里，想打印input和weight在cpu还是gpu上：

print(input.untyped_storage())
print(weight.untyped_storage())

input显示torch.storage.UntypedStorage(device=cpu) of size 4096]
weight显示CUDA error: an illegal memory access was encountered

这看起来就好像是input在cpu里，
当我想通过input.cuda()把他放到gpu里时，又报错 CUDA error: an illegal memory access was encountered
所以他到底是在cpu还是gpu里呢

想问佬佬们我要怎么排查到底是哪一步这么操作呢

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
cyctlove 2023-05-12 23:30
关注
这个错误提示表明你的某些 tensors 不在同一个设备上，具体来说，是你的输入数据（input_ids 和 attention_mask）在 CPU 上，而模型参数在 GPU 上。这可能是因为在数据加载时没有正确设置数据和模型的设备。

为了解决此问题，你可以采用以下方法：

在数据加载前将数据移动到相应的设备上，例如：

input_ids = input_ids.to(device) attention_mask = attention_mask.to(device)

其中，device 是你要将数据移动的设备，可以使用 torch.device() 函数来指定。

检查你的数据加载和分布式训练设置是否正确。如果你的代码包含分布式训练，那么在初始化训练器（如 DeepSpeed 或 PyTorch Lightning）时，需要指定正确的分布式参数和设备参数。

当出现 CUDA 错误时，可以通过增加调试语句来确定错误发生的位置。你可以在产生错误的地方前后打印中间结果，找到最先出现错误的位置。在你的示例代码中，你可以尝试在 forward 函数中打印每个输入的设备信息：

for key in batch: print(f"{key} device: {batch[key].device}") outputs = model(**batch)

最后，检查是否有其他设备操作，如在 GPU 上训练一个模型并同时使用 CPU 进行其他任务。如果存在这种情况，可能会导致资源不足的问题。

希望这些方法可以帮助你解决问题。AI作答
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

RuntimeError: CUDA error: invalid device ordinal 机器学习深度学习神经网络
2021-05-31 11:10

回答 2 已采纳在程序的前面加上，后面的数字要看你的显卡数目，意思是对该程序显示哪几张显卡可以使用。如果只有一张的话，要改成0.然后就是torch设置显卡的问题。最好这么写：torch.device('cuda:
关于多卡训练Bart的问题 pytorch 深度学习自然语言处理
2022-07-26 22:02

回答 1 已采纳通常的多卡训练是指每个显卡上都加载一样的模型，然后把 batch 平分到多卡上计算梯度后汇总，你报错在 gpu2，基本上断定多卡没问题，而是显存不够应对 BART 的大小。如果你要拆分模型以减轻显存消
YOLO V5怎么多GPU同时训练一个项目？深度学习神经网络
2021-05-11 10:28

回答 1 已采纳 parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu') defau
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0
2024-08-08 13:41

小李飞刀李寻欢的博客 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select) 遇到 Runtime...
C#问题，如何实现15693多卡重复查询，目前状态是，程序运行读卡器查询一次卡，然后就结束。想要实现一直查询的状态。 c# wpf
2022-06-02 17:08

回答 1 已采纳读卡器，读写是会有响应时间的，特别是当查询到数据时，对数据查询或者更新时，尽量异步更新或者开线程更新，这样不会阻塞设备读取的线程在定时器或者 while 中，增加 Thread.Sleep(1
yolo用hook取中间特征人工智能目标检测计算机视觉
2022-12-24 23:09

回答 1 已采纳在使用单机多卡训练的情况下，由于模型的结构被封装在了DistributedDataParallel模块中，因此在访问模型的层时应该使用model.module.children而不是model.chi
asp:GridView加载8万+的数据时网页刷新卡顿 asp.net
2019-06-12 11:19

回答 2 已采纳自己增加了一个分页控件，每页20条数据。已解决。
RuntimeError: Expected all tensors to be on the same device, but found at least two devices
2023-02-28 10:17

筱文rr的博客 RuntimeError: Expected all tensors to be on the same device, but found at least two devices
谁知道百度后台用的什么技术,感觉处理好快,而且不管网多卡打开速度都高于一般网站,坐等???
2016-09-28 14:13

回答 1 已采纳使用cdn，分布式分发网络。
react使用 ant design table组件进行操作后,刷新页面会出现复选框残留问题 react.js 前端
2022-03-22 10:16

回答 1 已采纳 selectedRowKeys 清空如果有selectedRows 也清空
vue中elementui的下拉框数据太大卡顿 vue.js
2021-09-28 14:14

回答 4 已采纳巧妙解决element-ui下拉框选项过多的问题 - Jason-HHC - 博客园 1. 场景描述不知道你有没有这样的经历，下拉框的选
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and
2022-08-20 17:16

bugmaker_mgcl的博客 Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
DeBug|RuntimeError: Expected all tensors to be on the same device, but found at least two devices
2022-07-13 15:54

_ccz的博客遇到报错： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! 原因：在项目中应用了不同序号cuda；解决办法：找出cuda应用的位置，如果要统一...
解决RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:
2023-05-24 09:25

aminghhhh的博客一个核心思路解决报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0，cpu等
RuntimeError: Expected all tensors to be on the same device, but found at least two devices...
2021-09-29 01:09

飞机火车巴雷特的博客问题RuntimeError: Expected all tensors to be on the same device, but found at least two devices...的解决方案
大模型-报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, c
2023-10-23 16:50

愚昧之山绝望之谷开悟之坡的博客报错 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm) 解决方案 ...
DDP报错::RuntimeError: Expected all tensors to be on the same device, but found at least two devic
2023-05-18 14:56

樱木之的博客 DDP报错 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 产生原因：未知。解决方法：反正你得在模型初始化后就将模型放到GPU上，再加载模型参数...
【pytorch】单机多GPU报错 :Expected all tensors to be on the same device, but found at least two devices
2023-07-06 13:19

阿委困的不能行的博客单卡多GPU训练报错处理思路
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
2021-10-21 21:58

桀骜不驯的山里男人的博客 error：RuntimeError:Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! reason：（小琼说一下自己的错误）出现在此错误的原因在于将网络使用GPU训练，但是只是...
Huggingface Trainer报错RuntimeError: Expected all tensors to be on the same device
2023-01-30 13:16

Q同学的nlp笔记的博客 Huggingface Trainer报错RuntimeError: Expected all tensors to be on the same device
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已结题（查看结题原因） 5月16日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 5月12日

悬赏问题

¥15 Android studio AVD启动不了
¥15 陆空双模式无人机怎么做
¥15 想咨询点问题，与算法转换，负荷预测，数字孪生有关
¥15 C#中的编译平台的区别影响
¥15 软件供应链安全是跟可靠性有关还是跟安全性有关？
¥15 电脑蓝屏logfilessrtsrttrail问题
¥20 关于wordpress建站遇到的问题！(语言-php)（相关搜索：云服务器）
¥15 【求职】怎么找到一个周围人素质都很高不会欺负他人，并且未来月薪能够达到一万以上（技术岗）的工作？希望可以收到写有具体，可靠，已经实践过了的路径的回答？
¥15 Java+vue部署版本反编译
¥100 对反编译和ai熟悉的开发者。

单机多卡训练RuntimeError: Expected all tensors to be on the same device, but found at least two devices

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新