weixin_39612228
weixin_39612228
2020-12-06 23:23

Does the LSTM adapt the size of the output char sequence during training for a given utterance?

Dear Friends,

Is the size of the output (the sequence of chars) of the model a deterministic function of the number of frames of the utterance?

Does the LSTM adapt the size of the output char sequence during training for a given utterance (same number of frames)?

Or for a given number of frames in the utterance, the output will always predict an output sequence with the same number of characters?

Thanks for the answer,

David

该提问来源于开源项目:SeanNaren/deepspeech.pytorch

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

4条回答

  • weixin_39609170 weixin_39609170 5月前

    Yes this is based on the convolutional net and the parameters you choose for the spectrogram. Currently given a second of audio, this creates 100 timesteps, that after deepspeech, go to 50 time steps to make character predictions on.

    点赞 评论 复制链接分享
  • weixin_39612228 weixin_39612228 5月前

    So many time steps are necessary to predict a single character, right?

    How do we know how many time steps will need to predict each character since we have more time steps than character?

    When predicting, how the decoder knows how to aggregate many time steps of probabilities to generate each character to appear on the transcript?

    点赞 评论 复制链接分享
  • weixin_39611208 weixin_39611208 5月前

    I recommend you read the following papers for background on this kind of architecture and the CTC loss function. Most if not all are available on arxiv. 1. Graves, Jaitly; Towards End-to-End Speech Recognition with Recurrent Neural Networks 2. Maas, et. al; Lexicon-Free Conversational Speech Recognition with Neural Networks 3. Hannun, et. al; First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

    点赞 评论 复制链接分享
  • weixin_39612228 weixin_39612228 5月前

    Thanks!

    点赞 评论 复制链接分享

相关推荐