Comment by dfajgljsldkjag

19 hours ago

It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.

Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.

Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.

I suspect the glued pipeline is going to remain dominant for a while, mostly because the intermediate text layer is structural, not just a byproduct. If you drop the text for a pure E2E model, you suddenly lose the ability to easily inject RAG context or handle complex tool use. I've been building some agent workflows recently and having that text state to pass into something like LangGraph is the only way to reliably control the logic. Without it, you are basically flying blind on the backend.

yes, I am currently playing with pipecat - both with ASR + LLM + TTS pipeline and also speech to text (ultravox) + TTS but haven't been successful with local speech to speech setups yet.