weixin_39535349
weixin_39535349
2020-12-01 23:17

nnet3-online-recognizer with endpointing fails to decode beginning of new utterance

Hey
Thanks for updating the offset value in the decoding method using endpointing in nnet3-online-recognizer.py.

It works a lot better but I still run into trouble with getting the beginning of a new utterance when we call asr.finalize_decoding(). If you look into line 124 of the script: feat_pipeline.accept_waveform(wav.samp_freq, remainder) It seems that no frames are actually added to the new feature pipeline and then they are not decoded when asr.advance_decoding() is called.

Since we finalize decoding each time we detect a silence, the current chunk (and possibly a few previous ones) contain no speech as long as chunk_size is short enough ie not longer than the silence length. Then wouldn't it be better to keep the related frames in the remainder? Something like:

remainder = data[i - a*chunk_size:i + chunk_size] with a = the number of past chunks to add to the new feature_pipeline.

This way, we would not decode twice the same part of speech and we could give some additional context for the beginning of the next utterance.

该提问来源于开源项目:pykaldi/pykaldi

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

7条回答

  • weixin_39620037 weixin_39620037 5月前

    Great! If you guys have any comments about these APIs, please let us know by commenting on the kaldi PRs. Those PRs are very much in progress, so we can still make changes to the design to accommodate additional needs.

    点赞 评论 复制链接分享
  • weixin_39620037 weixin_39620037 5月前

    Hey . nnet3-online-recognizer.py is an example script. We added it to the repo to help people unfamiliar with kaldi and pykaldi. The intent was to demonstrate how various parts of pykaldi fit together in an online speech recognition scenario. If the logic of the code does not work for your case, you can change the script as you wish. I mean that is exactly why we added these example scripts so people can customize them to match their needs 😄.

    Regarding your questions, I don't think I quite follow what you mean by

    I still run into trouble with getting the beginning of a new utterance when we call asr.finalize_decoding().

    What is the trouble you are running into?

    Also I don't quite follow this part

    It seems that no frames are actually added to the new feature pipeline and then they are not decoded when asr.advance_decoding() is called.

    Are you saying that the remainder is empty? The remainder should typically include a small number of samples and adding those samples to the new pipeline might not change the number of frames. We are adding the remainder here for correctness. I don't think it will actually affect decoding results in an appreciable way.

    Then wouldn't it be better to keep the related frames in the remainder? Something like: remainder = data[i - a*chunk_size:i + chunk_size] with a = the number of past chunks to add to the new feature_pipeline. This way, we would not decode twice the same part of speech and we could give some additional context for the beginning of the next utterance.

    This is certainly a reasonable thing to do if you would like to keep some extra past context. Note that immediate past context will not always be silence. Endpointing can trigger in other conditions.

    点赞 评论 复制链接分享
  • weixin_39535349 weixin_39535349 5月前

    Thanks for your answer !

    Regarding the troubles getting the beginning of the utterance, I mean each time endpointing is detected, we call asr.finalize_decoding() and we create a new feature pipeline for the next utterance. At that point, I feel like a few frames are lost and some speech is not decoded. The beginning of the transcription for this new utterance is missing. Is it something you're aware of ?

    To me this was caused by the remainder being too small. It is not empty but after adding the remainder data to the feature pipeline line 124, num_frames_ready() is still zero whereas it should equal the number of remaining frames in the chunk. As you say you added this remainder only for correctness, maybe you're aware of this but then it sounds useless to me or I might be missing something.

    This is certainly a reasonable thing to do if you would like to keep some extra past context. Note that immediate past context will not always be silence. Endpointing can trigger in other conditions.

    You're right here but the only case endpointing will trigger in other condition that silence is when the utterance is longer than 20sec, which should not happen so often (at least in the situation I handle).

    点赞 评论 复制链接分享
  • weixin_39620037 weixin_39620037 5月前

    Regarding the troubles getting the beginning of the utterance, I mean each time endpointing is detected, we call asr.finalize_decoding() and we create a new feature pipeline for the next utterance. At that point, I feel like a few frames are lost and some speech is not decoded. The beginning of the transcription for this new utterance is missing. Is it something you're aware of ?

    I am not aware of this. This might be due to additional left feature context used by the acoustic model. I am pretty sure the current code is not dropping any frames but it is also not keeping the left feature context, which would be the last few frames of previous utterance. Each utterance is decoded as if there is no left context. If your acoustic models are trained on data where all utterances are padded with silence at the beginning and end, the acoustic model might expect to see these silence frames in test utterances. Your approach of adding past context just before the endpoint would indeed help with such an issue.

    To me this was caused by the remainder being too small. It is not empty but after adding the remainder data to the feature pipeline line 124, num_frames_ready() is still zero whereas it should equal the number of remaining frames in the chunk. As you say you added this remainder only for correctness, maybe you're aware of this but then it sounds useless to me or I might be missing something.

    It is not surprising that num_frames_ready() is still zero after adding the remainder. Under typical feature extraction settings, remainder will have a small number of samples which may not amount to even a single feature frame. As I said, I added the remainder for correctness, I don't expect it to have an appreciable effect on decoding under typical conditions. Again, in your scenario, it makes sense to keep more samples as the remainder so the acoustic model will have access to enough context for the next utterance.

    点赞 评论 复制链接分享
  • weixin_39612877 weixin_39612877 5月前

    I have same issue (some incorrect result was occurred at the beginning of new utterance). So I changed the source code like this. Please check the below source code.

    total_num_frames_decoded = 0
    ...
    
    total_num_frames_decoded += asr.decoder.num_frames_decoded()
    t1 = total_num_frames_decoded * asr.decodable_opts.frame_subsampling_factor
    t2 = feat_pipeline.frame_shift_in_seconds()
    offset = int(t1 * t2 * wav.samp_freq)
    ...
    
    remainder = data[offset:i + chunk_size]
    feat_pipeline.accept_waveform(wav.samp_freq, remainder)
    
    点赞 评论 复制链接分享
  • weixin_39620037 weixin_39620037 5月前

    I guess this makes sense. I am not sure why we had feat_pipeline.num_frames_ready() instead of asr.decoder.num_frames_decoded() * asr.decodable_opts.frame_subsampling_factor in offset computation. We are also discussing this same issue on kaldi pull requests https://github.com/kaldi-asr/kaldi/pull/2938 and https://github.com/kaldi-asr/kaldi/pull/3008. Once we improve the API in kaldi, handling these in pykaldi will be simpler too.

    点赞 评论 复制链接分享
  • weixin_39535349 weixin_39535349 5月前

    I was watching the two kaldi PR you mentioned where you're addressing the same issue. You highlighted the problem of left context missing and its consequences on decoding:

    Whenever we reset the feature pipeline, we lose the left frame context. Unlike the first issue, this has an effect on decoding results since acoustic models typically make use of left context.

    That's why I didn't understand why you said that 'each utterance is decoded as there is no left context' in your previous answer.

    Anyway solution works fine but I agree the improvements you're making in the kaldi PRs will make our problem easier to handle.

    点赞 评论 复制链接分享

相关推荐