Comment by nicktikhonov
18 hours ago
100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.
Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.
See also any public speaking who starts every answer to a question from the audience (or in a verbal interview) with something like 'that is a good question!' or "thank you for asking me that!"
Same strategy but employed by humans.
"You are absolutely right!"