请问 想使用deepspeed进行单机多卡的finetune时,报错:
Traceback (most recent call last):
File "main.py", line 440, in <module>
main()
File "main.py", line 397, in main
perplexity = evaluation(model, eval_dataloader)
File "main.py", line 323, in evaluation
outputs = model(**batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1695, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2213, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
然后另一块卡报的就是
cuda:1 and cpu!
查了两天了,知道是因为一部分数据在cpu上,一部分在gpu上,但是哪一块在cpu上,我又怎么挪过去呢?
到最终报错的torch/nn/functional.py里,想打印input和weight在cpu还是gpu上:
print(input.untyped_storage())
print(weight.untyped_storage())
input显示torch.storage.UntypedStorage(device=cpu) of size 4096]
weight显示CUDA error: an illegal memory access was encountered
这看起来就好像是input在cpu里,
当我想通过input.cuda()把他放到gpu里时,又报错 CUDA error: an illegal memory access was encountered
所以他到底是在cpu还是gpu里呢
想问佬佬们我要怎么排查到底是哪一步这么操作呢