Comment by ripped_britches
16 hours ago
speech to speech is not nearly as good as livekit IMO ("old school" sequence of transcribe, LLM, synthesize). depends on what you're doing of course, but this is just because the LLMs are just way smarter than the speech to speech models which are pretty much the worst (again IMO) at anything beyond basic banter. and livekit is just a framework so you can hook it up with any models in the stack. im not an expert on the local parts but i would assume this pretty easy to glue together.
They work for two entirely different things. The problem with these pipelines is that unless the latency is very low they simply aren't suitable replacements for Alexa etc. For that use case, low latency beats smarts.
The latency is very very low in my experience, it would definitely work well as an Alexa style assistant