← Back to context

Comment by luke-stanley

1 year ago

A cross-platform browser VAD module is: https://github.com/ricky0123/vad. This is an ONNX port of Silero's VAD network. By cross-platform, I mean it works in Firefox too. It doesn't need a WebRTC session to work, just microphone access, so it's simpler. I'm curious about the browser providing this as a native option too.

There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.

I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

The breakdown provided is interesting.

I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?

>> I'm curious about the browser providing this as a native option too.

IMHO the desktop environment should provide voice to text as a service with a standard interface to applications - like stdin or similar but distinct for voice. Apps would ignore it by default since they aren't listening, but the transcriber could be swapped out and would be available to all apps.

If you do stt and tts on the device but everything else remains the same, according to these numbers that saves you 120ms. The remaining 639ms is hardware and network latency, and shuffling data into and out of the LLM. That's still slower than you want.

Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.