Google Speech API + Go-转录长度未知的音频流

I have an rtmp stream of a video call and I want to transcribe it. I have created 2 services in Go and I'm getting results but it's not very accurate and a lot of data seems to get lost.

Let me explain.

I have a transcode service, I use ffmpeg to transcode the video to Linear16 audio and place the output bytes onto a PubSub queue for a transcribe service to handle. Obviously there is a limit to the size of the PubSub message, and I want to start transcribing before the end of the video call. So, I chunk the transcoded data into 3 second clips (not fixed length, just seems about right) and put them onto the queue.

The data is transcoded quite simply:

var stdout Buffer

cmd := exec.Command("ffmpeg", "-i", url, "-f", "s16le", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", "-")
cmd.Stdout = &stdout

if err := cmd.Start(); err != nil {
    log.Fatal(err)
}

ticker := time.NewTicker(3 * time.Second)

for {
    select {
    case <-ticker.C:
        bytesConverted := stdout.Len()
        log.Infof("Converted %d bytes", bytesConverted)

        // Send the data we converted, even if there are no bytes.
        topic.Publish(ctx, &pubsub.Message{
            Data: stdout.Bytes(),
        })

        stdout.Reset()
    }
}

The transcribe service pulls messages from the queue at a rate of 1 every 3 seconds, helping to process the audio data at about the same rate as it's being created. There are limits on the Speech API stream, it can't be longer than 60 seconds so I stop the old stream and start a new one every 30 seconds so we never hit the limit, no matter how long the video call lasts for.

This is how I'm transcribing it:

stream := prepareNewStream()
clipLengthTicker := time.NewTicker(30 * time.Second)
chunkLengthTicker := time.NewTicker(3 * time.Second)

cctx, cancel := context.WithCancel(context.TODO())
err := subscription.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {

    select {
    case <-clipLengthTicker.C:
        log.Infof("Clip length reached.")
        log.Infof("Closing stream and starting over")

        err := stream.CloseSend()
        if err != nil {
            log.Fatalf("Could not close stream: %v", err)
        }

        go getResult(stream)
        stream = prepareNewStream()

    case <-chunkLengthTicker.C:
        log.Infof("Chunk length reached.")

        bytesConverted := len(msg.Data)

        log.Infof("Received %d bytes
", bytesConverted)

        if bytesConverted > 0 {
            if err := stream.Send(&speechpb.StreamingRecognizeRequest{
                StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
                    AudioContent: transcodedChunk.Data,
                },
            }); err != nil {
                resp, _ := stream.Recv()
                log.Errorf("Could not send audio: %v", resp.GetError())
            }
        }

        msg.Ack()
    }
})

I think the problem is that my 3 second chunks don't necessarily line up with starts and end of phrases or sentences so I suspect that the Speech API is a recurrent neural network which has been trained on full sentences rather than individual words. So starting a clip in the middle of a sentence loses some data because it can't figure out the first few words up to the natural end of a phrase. Also, I lose some data in changing from an old stream to a new stream. There's some context lost. I guess overlapping clips might help with this.

I have a couple of questions:

1) Does this architecture seem appropriate for my constraints (unknown length of audio stream, etc.)?

2) What can I do to improve accuracy and minimise lost data?

(Note I've simplified the examples for readability. Point out if anything doesn't make sense because I've been heavy handed in cutting the examples down.)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douba1067 2018-02-14 18:18
关注
I think you are right that splitting the text into chunks causes many words to be chopped off.

I see another problem in the publishing. Between the calls topic.Publish and stdout.Reset() some time will pass and ffmpeg will probably have written some unpublished bytes to stdout, which will get cleared by the reset.

I am afraid the architecture is not fitted for your problem. The constraint of the message size causes many problems. The idea of a PubSub system is that a publisher notifies subscribers of events, but not necessarily to hold a large payload.

Do you really need two services? You could use two go routines to communicate via a channel. That would eliminate the pub sub system.

A strategy would be to make the chunks as large as possible. A possible solution:

Make the chunks as large as possible (nearly 60 seconds)

Make the chunks overlap each other by a short time (e.g. 5 seconds)

Programmatically detect the overlaps and remove them
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

php cuRL响应“无法转码数据流音频/ flac - > audio / x-float-array” - IBM Watson Speech to text API php
2019-02-03 10:12

回答 1 已采纳 You are setting CURLOPT_POSTFIELDS twice, once with the content of your file and a second time wit
Google Speech API“请求中的采样率与FLAC标头不匹配” php
2016-09-28 04:11

回答 2 已采纳 Fixed it by being very specific in my FFMPEG command: $cmd = 'C:/wamp/www/ffmpeg/bin/ffmpeg.exe -
关于Android Studio使用google api语音识别的应用 android-studio 语音识别
2018-03-13 01:21

回答 2 已采纳在6.0以前的系统，都是权限一刀切的处理方式，只要用户安装，Manifest申请的权限都会被赋予，并且安装后权限也撤销不了。 Android 6.0 采用新的权限模型，只有在需要权限的时候，才告
transcribe-live-audio：使用Google Cloud Speech to Text API转录实时音频
2021-02-04 15:56

使用Google Cloud Speech to Text API录制实时音频该脚本为Google Cloud Speech to Text API提出的60时限限制提供了一种解决方法。解决方法该脚本在使用API转录缓冲区中的“音频块”之前，先将音频/麦克风...
点击播放网站上的Google Text to Speech Api javascript jquery php
2015-04-09 07:34

回答 1 已采纳 function say( text ){ if('speechSynthesis' in window) { // Chrome only !! var speech = new
Json解码从PHP中的Google Dictionary API返回的响应 json php
2013-09-28 17:32

回答 1 已采纳 I think this is a case of overcomplicating something. The regex default property of greediness ma
在Qt C ++项目中使用Go c++ qt
2018-05-18 15:01

回答 1 已采纳 Is it possible to use a Go API in a Qt C++ project? It could be possible, but it might not be
speech-to-text:一个简单的应用程序，可使用语音到文本API将音频文件转换为文本
2021-04-29 12:01

由于Google Speech To Text API对音频长度的限制，该应用程序使用Amazon Transcribe API来实现音频长度大于1分钟，而小于1分钟的音频则使用Google API处理。安装克隆仓库： git clone ...
关于Windows speech SDK 5.1在XP系统上的问题 windows
2017-08-21 02:24

回答 1 已采纳 msm是合并块，是做windows installer安装程序用的，直接想安装，应该找exe msi格式的安装包 http://blog.csdn.net/pamchen/article/detai
在vue项目里使用了speak-tts，但是没有声音播放出来，不知道为什么 javascript
2022-02-12 15:01

回答 1 已采纳报错？
各位大神大仙快来帮帮小弟吧，是个关于TextToSpeech播放文本的问题 android
2016-11-29 02:09

回答 1 已采纳 http://blog.csdn.net/csdn_blog_lcl/article/details/52504323
如何使用Google Cloud Speech API转录音频/视频文件。-PHP开发
2021-05-27 08:52

Google Cloud Speech这些示例说明了如何使用Google Cloud Speech API转录音频文件。它以mp4文件作为参数，将其转换为FLAC编码（无损编码）的音频，并在Google Cloud Speech中中断音频文件。这些示例说明了如何使用...
js SpeechSynthesisUtterance 语音播报，页面加载完毕后出声，谷歌现在出不了声音 javascript
2022-03-09 15:54

回答 2 已采纳没办法的，这个标准已经再逐渐推广开了，禁止网页的音频自动播放，包含音频的视频也不允许自动播放了，诸如腾讯视频或者某某直播等都采用了将视频静音的方式才实现首页视频自动播放。
Speech-to-Text-Converter:使用Google Speech Cloud API将语音转换为文本的工具，可将语音转换为文本格式
2021-04-29 09:18

使用Recorder.js的Google Speech to text REST API实现： Google语音转文本API与Recorder.js库一起使用。它将从麦克风获取音频，并将音频数据传递到Google API Explorer API（REST API）。我们已经使用Recorder js库...
go-google-speech-api:后端使用golang +前端使用AngularJS
2021-05-12 11:17

这是使用Google Speech API的示例应用
Chrome-Web-Speech-API：Chrome Web语音API
2021-01-31 05:49

git https://github.com/bensonruan/Chrome-Web-Speech-API.git 将您的本地主机指向克隆的根目录使用Chrome浏览器浏览到。开始用声音打字通过单击麦克风图标打开麦克风允许浏览器访问您的麦克风开始说话支持...
vue-web-speech:Web Speech API的Vue包装器，用于识别语音
2021-03-08 20:25

用于语音识别的Web Speech API的Vue包装器。 Web Speech API处于试验阶段，在生产中使用之前请检查。安装 npm i vue-web-speech 用法通过Vue.use将插件注入到vue实例 import Vue from 'vue' import VueWebSpeech ...
没有解决我的问题, 去提问

悬赏问题

¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码
¥15 js调用html页面需要隐藏某个按钮

Google Speech API + Go-转录长度未知的音频流

1条回答 默认 最新

悬赏问题

1条回答默认最新