Comment by soulofmischief

16 hours ago

I have a great local assistant that works end-to-end with voice. It's built on local, web-first technologies, it fits small LLMs in memory and manages inference and TTS/STT without stuttering. I've been shaping it up over a couple years and constantly switching out new models.

If you want something simple that runs in browser, look at vosk-browser[0] and vits-web[1].

I'd also recommend checking out KittenTTS[2], I use it and it's great for the size/performance. However, you'd need to implement a custom JavaScript harness for the model since it's a python project. If you need help with that, shoot me an email and I can share some code.

There are other great approaches too if you don't mind python, personally I chose the web as a platform in order to make my agent fully portable and remote once I release it.

And of course, NVIDIA's new model just came out last week[3] but I haven't gotten to test it out just yet, and also there was the recent Sparrow-1[4] announcement which shows people are finally putting money into the problems plaguing voice agents that are rigged up from several models and glue infrastructure, vs a single end-to-end model or at least a conversational turn-taking model to keep things on rails.

[0] https://www.npmjs.com/package/vosk-browser

[1] https://github.com/diffusionstudio/vits-web

[2] https://github.com/KittenML/KittenTTS

[3] https://research.nvidia.com/labs/adlr/personaplex/

[4] https://www.tavus.io/post/sparrow-1-human-level-conversation...