邀请@Mr.Winter` :
基于Pytorch框架,在训练nn.RNN模型时,反向传播报错。代码简化为以下:
Python版本3.9,torch版本1.13.1
```python
import torch
rnn = torch.nn.RNN(input_size=1, hidden_size=1, num_layers=1)
train_set_x = torch.tensor([[[1]],[[2]],[[3]],[[4]],[[5]]], dtype=torch.float32)
train_set_y = torch.tensor([[[2]],[[4]],[[6]],[[8]],[[10]]], dtype=torch.float32)
h0 = torch.tensor([[0]], dtype=torch.float32)
h_cur = h0
loss = torch.nn.MSELoss()
opt = torch.optim.Adadelta(rnn.parameters(), lr = 0.01)
for i in range(5):
opt.zero_grad()
train_output, h_next = rnn(train_set_x[i], h_cur)
rnn_loss = loss(train_output,train_set_y[i])
rnn_loss.backward()
opt.step()
print(train_output)
h_cur = h_next
报错内容
```python
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
按照提示修改代码:
import torch
rnn = torch.nn.RNN(input_size=1, hidden_size=1, num_layers=1)
train_set_x = torch.tensor([[[1]],[[2]],[[3]],[[4]],[[5]]], dtype=torch.float32)
train_set_y = torch.tensor([[[2]],[[4]],[[6]],[[8]],[[10]]], dtype=torch.float32)
h0 = torch.tensor([[0]], dtype=torch.float32)
h_cur = h0
loss = torch.nn.MSELoss()
opt = torch.optim.Adadelta(rnn.parameters(), lr = 0.01)
with torch.autograd.set_detect_anomaly(True):
for i in range(5):
opt.zero_grad()
train_output, h_next = rnn(train_set_x[i], h_cur)
rnn_loss = loss(train_output,train_set_y[i])
rnn_loss.backward(retain_graph=True)
opt.step()
print(train_output)
h_cur = h_next
仍然报错:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
让人费解的是上面的提示是因为“置位”操作导致的,但是我全程没有用过“+=”这种操作。
最最让人费解的是,通过print可以看到前两个训练循环是有输出的,到第三个循环就开始报错了:
tensor([[0.1129]], grad_fn=<SqueezeBackward1>)
tensor([[-0.1872]], grad_fn=<SqueezeBackward1>)
C:\Users\Lenovo\Desktop\DL\LSTM_poem\lib\site-packages\torch\autograd\__init__.py:197: UserWarning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:
File "C:\Users\Lenovo\Desktop\DL\LSTM_poem\test.py", line 17, in <module>
train_output, h_next = rnn(train_set_x[i], h_cur)
....后面省略
在C站上有很多提到过这个问题的解决方式,尝试解决都不灵。