Comment by brody_hamer

16 hours ago

> Voice is a turn-taking problem

It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)

12 comments

brody_hamer

nicktikhonov 16 hours ago

100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.

dotancohen 14 hours ago

Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.
eru 11 hours ago
See also any public speaking who starts every answer to a question from the audience (or in a verbal interview) with something like 'that is a good question!' or "thank you for asking me that!"
Same strategy but employed by humans.
- digitallyamar 14 minutes ago
  
  "You are absolutely right!"

arcadianalpaca 1 hour ago

The filler word idea is interesting but I suspect the uncanny valley risk is super high. A mistimed "mhm" from a computer would probably feel way worse than just silence, because now your brain is pattern matching against human conversation and every small timing error stands out more

Rohunyyy 11 hours ago

I am not sure about the low hanging fruit. Its not easy to make something robotic more human. Based on personal experience I thought it would be a low hanging fruit for text. Take a simple LLM answer to anything and replace the "-" and "its not x its y" thingy that people almost always associate with LLMs to something else. Guess what? Now those answers sound even MORE robotic. Obviously this was a pet project that I cooked up in less than an hour but the more I tried to make it human the more it became ai

starkparker 15 hours ago

Recently: https://blog.livekit.io/prompting-voice-agents-to-sound-more...

phkahler 15 hours ago

Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.

fragmede 14 hours ago
it's bad enough how to deal with people that don't think before they speak now we gotta make the computers do it as well‽
- eru 11 hours ago
  
  Huh, the grandfather was suggestion to have the computer think while you speak.
  That's different from banning the computer from thinking before they speak, ain't it?
  
  1 reply →

DoctorOetker 11 hours ago

1) if the system misdetected end-of-turn and has swiftly realized its error too late, and if we collect 90% of English syllables and find filler that starts with the syllable, it might allow to terminate the commitment to interrupt the speaker by turning it into background filler

2) if end-of-turn was detected very late, we can randomly select a first phonetic syllable, and then add it in the prompt that the reply should start with this syllable!