2020-12-01 12:47

# Layer normalization

It would be nice to support some form of layer normalization in LSTM and GRU layer (example https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/custom_lstms.py#L171)

• 点赞
• 写回答
• 关注问题
• 收藏
• 复制链接分享
• 邀请回答

#### 5条回答

• weixin_39761655 4月前

Here's what the haste.LayerNormLSTM implementation looks like:

$\begin{pmatrix}&space;\mathbf{f}_t&space;\&space;\mathbf{i}_t&space;\&space;\mathbf{o}_t&space;\&space;\mathbf{g}_t&space;\end{pmatrix}&space;=&space;LN(\mathbf{W}_h&space;\mathbf{h}_{t-1};&space;\boldsymbol{\gamma}_1,&space;\boldsymbol{0})&space;+&space;LN(\mathbf{W}_x&space;\mathbf{x}_{t};&space;\boldsymbol{\gamma}_2,&space;\boldsymbol{0})&space;+&space;b$ $\mathbf{c}_t&space;=&space;\sigma(\mathbf{f}_t)&space;\odot&space;\mathbf{c}_{t-1}&space;+&space;\sigma(\mathbf{i}_t)&space;\odot&space;tanh(\mathbf{g}_t)$ $\mathbf{h}_t&space;=&space;\sigma(\mathbf{o}_t)&space;\odot&space;tanh(LN(\mathbf{c}_t;&space;\boldsymbol{\gamma}_3,&space;\boldsymbol{\beta}_3))$

This implementation is nearly identical to eqs. 20–22 of the layer norm paper. The differences are: 1. we don't apply a bias term to layer norms on the input or recurrent connection; these parameters are unnecessary since there's already a bias term (... + b) applied by the LSTM 1. we use $\gamma$ instead of $\alpha$ to denote the gain parameter (notation change) 1. we initialize $\gamma$ to 1 and $\beta$ to 0 instead of the other way around (seems like a typo in the paper)

I haven't gotten around to updating the docs yet but haste.LSTM can just be replaced with haste.LayerNormLSTM. Zoneout, DropConnect, etc. are all supported in LayerNormLSTM as well.

点赞 评论 复制链接分享
• weixin_39738115 4月前

Nice! Having GRU would be also great, but we can probably manage with LSTMs :)

点赞 评论 复制链接分享
• weixin_39761655 4月前

Our LSTM implementation is much further ahead than the GRU one so we started with LSTMs first. When we do the GRU updates, we'll keep LayerNorm in mind. Thanks for the feature request!

点赞 评论 复制链接分享
• weixin_39761655 4月前

Hmm that's an interesting implementation. They're applying layer norm to $c_t$ in addition to $h_t$. The supplementary material in Ba et al. (pp. 13–14) only applies layer norm to $h_t$ in both of their LSTM variants.

Do you know if there's any follow-up literature that explains the PyTorch variant?

点赞 评论 复制链接分享
• weixin_39738115 4月前

I do not about any. I personally think that any variant of GRU/LSTM with LayerNorm would be great addition.

点赞 评论 复制链接分享