Comment by regularfry
1 year ago
If you do stt and tts on the device but everything else remains the same, according to these numbers that saves you 120ms. The remaining 639ms is hardware and network latency, and shuffling data into and out of the LLM. That's still slower than you want.
Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.
No comments yet
Contribute on Hacker News ↗