Comment by andhuman

12 hours ago

I built this recently. I used nvidia parakeet as STT, open wake word as the wake word detection, mistral ministral 14b as LLM and pocket tts for tts. Fits snugly in my 16 gb VRAM. Pocket is small and fast and has good enough voice cloning. I first used the chatterbox turbo model, which perform better and even supported some simple paralinguistic word like (chuckle) that made it more fun, but it was just a bit too big for my rig.

OP asked:

> Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?

Your setup is the latter, not the former.