Comment by gunalx

7 hours ago

Yep, this is something end tl end models need to solve to be ideal I think. I hve seen a split brain architecture with one speaking and one thinking brain. If the thinking one could have some text tokens as output and input, to be able to refine on reasoning and rag+tools and the audio brain doing parallel audio decode.