Comment by brody_hamer
16 hours ago
> Voice is a turn-taking problem
It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)
100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.
Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.
See also any public speaking who starts every answer to a question from the audience (or in a verbal interview) with something like 'that is a good question!' or "thank you for asking me that!"
Same strategy but employed by humans.
"You are absolutely right!"
The filler word idea is interesting but I suspect the uncanny valley risk is super high. A mistimed "mhm" from a computer would probably feel way worse than just silence, because now your brain is pattern matching against human conversation and every small timing error stands out more
I am not sure about the low hanging fruit. Its not easy to make something robotic more human. Based on personal experience I thought it would be a low hanging fruit for text. Take a simple LLM answer to anything and replace the "-" and "its not x its y" thingy that people almost always associate with LLMs to something else. Guess what? Now those answers sound even MORE robotic. Obviously this was a pet project that I cooked up in less than an hour but the more I tried to make it human the more it became ai
Recently: https://blog.livekit.io/prompting-voice-agents-to-sound-more...
Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.
it's bad enough how to deal with people that don't think before they speak now we gotta make the computers do it as well‽
Huh, the grandfather was suggestion to have the computer think while you speak.
That's different from banning the computer from thinking before they speak, ain't it?
1 reply →
1) if the system misdetected end-of-turn and has swiftly realized its error too late, and if we collect 90% of English syllables and find filler that starts with the syllable, it might allow to terminate the commitment to interrupt the speaker by turning it into background filler
2) if end-of-turn was detected very late, we can randomly select a first phonetic syllable, and then add it in the prompt that the reply should start with this syllable!