2021-01-08 11:45

Support cpu/gpu utterance batch decoding for Transformer ASR model

Similar to e2e_asr, this is simple implementation of recognize_batch function in e2e_asr_transformer, which could serve as a temporary solution of Transformer gpu utterance batch decoding before the official support of api v2. It has been tested on wsj.

During the debugging process, we notice that the way of subsampling masks will lead to some encoder output inconsistency between single-utterance and batch in Conv2dSubsampling due to the padding. (which might also slightly affect training...)

For example, currently the forward function returns x_mask[:, :, :-2:2][:, :, :-2:2] , if there is a 5-frame utterance batchfied with a longer one (7 frames), then the mask is [[True, True, True, True, True, False, False], [True, True, True, True, True, True, True]], after subsampling it becomes [[True, True, True], [True, True, True]], which is inconsistent with when we input a single 5-frame utterance (from [True, True, True, True, True] to [True, True] , 2 True in masks, as illustrated in the figure below).

Therefore, we change the mask subsampling to forms like x_mask[:, :, 2::2][:, :, 2::2] so that the subsampled mask will not be affected by padding in the end.


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答