weixin_39620037
weixin_39620037
2021-01-08 14:35

Any plan for utterance batch decoding for Transformer asr model?

I noticed that RNN based model has the utterance batch decoding method implemented in recognize_batch, however there is no such implementation for Transformer based asr model.

I understand that more advanced interface for beam search has been moved to recog_v2, but in this new implementation I only can find the beam batch interface batch_score rather than utterance batch decode method. So any plan for the utterance batch decoding for Transformer based asr model?

该提问来源于开源项目:espnet/espnet

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

7条回答

  • weixin_39620037 weixin_39620037 4月前

    Maybe it's a not a good ideas to archive such purpose by this way. During Decoding, I observed a strange phenomenon that GPU-Util reached at a high level like 90%+ while the true power usage was much lower. Shown in below. Snipaste_2020-11-17_11-15-27

    I don't know much about GPU working mechanism, but according to my experiment experience, we will not get much performance gain in such situation. However, merging utt batch into beam batch may help.

    点赞 评论 复制链接分享
  • weixin_39927408 weixin_39927408 4月前

    Year, 's solution does not solve the GPU efficiency issue, but it is useful to extend it without changing our API.

    What is your purpose? Our current implementation corresponds to the S=1 result at Table 6 in https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2860.pdf Yes, if we use utterance batch, we could obtain around two times faster decoding but the normal test set is much smaller than the training data and we also don't need multiple epochs. So, I don't have so many issues with the current implementation.

    If the purpose is to perform decoding during training (e.g., MBR), it would be a problem. Or if your application is to automatically transcribe a huge amount of speech data, it would have great benefits of using utterance batch. We also have such use cases and I can increase the priority of the development of utterance batch processing depending on your request.

    点赞 评论 复制链接分享
  • weixin_39900286 weixin_39900286 4月前

    Utterance batch decoding is very useful function for semi-supervised training using large unlabeled speech corpus, especially for self-training or pseudo labeling methods. So, it would be nice to support utterance batch decoding.

    点赞 评论 复制链接分享
  • weixin_39927408 weixin_39927408 4月前

    Yes, this is another super important application. Thanks for your comment, . We'll make the priority of this development item higher.

    点赞 评论 复制链接分享
  • weixin_39620037 weixin_39620037 4月前

    Year, 's solution does not solve the GPU efficiency issue, but it is useful to extend it without changing our API.

    What is your purpose? Our current implementation corresponds to the S=1 result at Table 6 in https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2860.pdf Yes, if we use utterance batch, we could obtain around two times faster decoding but the normal test set is much smaller than the training data and we also don't need multiple epochs. So, I don't have so many issues with the current implementation.

    If the purpose is to perform decoding during training (e.g., MBR), it would be a problem. Or if your application is to automatically transcribe a huge amount of speech data, it would have great benefits of using utterance batch. We also have such use cases and I can increase the priority of the development of utterance batch processing depending on your request.

    Thanks. To be honest, my application might be more engineering as there will be multiple audio utterances need to be processed in the same time and it's a time sensitive situation. Using torch.multiprocessing may not be a good idea because GPU utilization has reached its bottleneck while doing utterance batching can further improve the decoding speed (I guess).

    I think really provided a great situation that utt batch decoding can help a lot and such situation is more common in research purpose.

    Thank you for all your helpful discussions and I will close this issue at this moment.

    点赞 评论 复制链接分享
  • weixin_39927408 weixin_39927408 4月前

    This is a very good discussion. We decided to work on the batch scoring only and did not implement an utterance batch since it makes our implementation simple for the API v2 and also the main ASR decoding scenario uses a single utterance. However, we also think the utterance batch is important for the off-line ASR scenarios and decoding during training. We can make this part a higher priority upon a request. (Added for this discussion.)

    点赞 评论 复制链接分享
  • weixin_39579468 weixin_39579468 4月前

    I wonder if we can use multi processing API that requires no changes in recog_v2 https://pytorch.org/docs/stable/notes/multiprocessing.html

    python
    import torch.multiprocessing as mp
    from model import MyModel
    from utterance import utt_loader
    
    def recog(model):
        result = []
        for utt in utt_loader('path to utt list'):
            result.append(model.recognize(utt))
        torch.save(result, f'result.{rank}.pkl')
    
    if __name__ == '__main__':
        num_processes = 4
        model = MyModel('path to model')
        processes = []
        for rank in range(num_processes):
            p = mp.Process(target=recog, args=(model,))
            p.start()
            processes.append(p)
        for p in processes:
            p.join()
    
    点赞 评论 复制链接分享

相关推荐