weixin_39738115
weixin_39738115
2020-12-01 12:47

Layer normalization

It would be nice to support some form of layer normalization in LSTM and GRU layer (example https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/custom_lstms.py#L171)

该提问来源于开源项目:lmnt-com/haste

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

5条回答

  • weixin_39761655 weixin_39761655 4月前

    Here's what the haste.LayerNormLSTM implementation looks like:

    This implementation is nearly identical to eqs. 20–22 of the layer norm paper. The differences are: 1. we don't apply a bias term to layer norms on the input or recurrent connection; these parameters are unnecessary since there's already a bias term (... + b) applied by the LSTM 1. we use instead of to denote the gain parameter (notation change) 1. we initialize to 1 and to 0 instead of the other way around (seems like a typo in the paper)

    I haven't gotten around to updating the docs yet but haste.LSTM can just be replaced with haste.LayerNormLSTM. Zoneout, DropConnect, etc. are all supported in LayerNormLSTM as well.

    点赞 评论 复制链接分享
  • weixin_39738115 weixin_39738115 4月前

    Nice! Having GRU would be also great, but we can probably manage with LSTMs :)

    点赞 评论 复制链接分享
  • weixin_39761655 weixin_39761655 4月前

    Our LSTM implementation is much further ahead than the GRU one so we started with LSTMs first. When we do the GRU updates, we'll keep LayerNorm in mind. Thanks for the feature request!

    点赞 评论 复制链接分享
  • weixin_39761655 weixin_39761655 4月前

    Hmm that's an interesting implementation. They're applying layer norm to c_t in addition to h_t. The supplementary material in Ba et al. (pp. 13–14) only applies layer norm to h_t in both of their LSTM variants.

    Do you know if there's any follow-up literature that explains the PyTorch variant?

    点赞 评论 复制链接分享
  • weixin_39738115 weixin_39738115 4月前

    I do not about any. I personally think that any variant of GRU/LSTM with LayerNorm would be great addition.

    点赞 评论 复制链接分享