Comment by dfajgljsldkjag

16 days ago

It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.

Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.

Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.

3 comments

dfajgljsldkjag

storystarling 16 days ago

I suspect the glued pipeline is going to remain dominant for a while, mostly because the intermediate text layer is structural, not just a byproduct. If you drop the text for a pure E2E model, you suddenly lose the ability to easily inject RAG context or handle complex tool use. I've been building some agent workflows recently and having that text state to pass into something like LangGraph is the only way to reliably control the logic. Without it, you are basically flying blind on the backend.

gunalx 15 days ago

Yep, this is something end tl end models need to solve to be ideal I think. I hve seen a split brain architecture with one speaking and one thinking brain. If the thinking one could have some text tokens as output and input, to be able to refine on reasoning and rag+tools and the audio brain doing parallel audio decode.

dsrtslnd23 16 days ago

yes, I am currently playing with pipecat - both with ASR + LLM + TTS pipeline and also speech to text (ultravox) + TTS but haven't been successful with local speech to speech setups yet.