Comment by storystarling
7 hours ago
I suspect the glued pipeline is going to remain dominant for a while, mostly because the intermediate text layer is structural, not just a byproduct. If you drop the text for a pure E2E model, you suddenly lose the ability to easily inject RAG context or handle complex tool use. I've been building some agent workflows recently and having that text state to pass into something like LangGraph is the only way to reliably control the logic. Without it, you are basically flying blind on the backend.
Yep, this is something end tl end models need to solve to be ideal I think. I hve seen a split brain architecture with one speaking and one thinking brain. If the thinking one could have some text tokens as output and input, to be able to refine on reasoning and rag+tools and the audio brain doing parallel audio decode.