Comment by drakenot
2 days ago
Something that really kills the 'effect' of most of the Voice > AI demos that I see is the cold start / latency.
The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.
Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.
Yeah the way I am handling this is turn detection which feels unnatural. I like how Livekit handles turn detection with a small model[0][1] [0]https://www.youtube.com/watch?v=EYDrSSEP0h0 [1]https://docs.livekit.io/agents/build/turns/turn-detector/
``` turn_detection: { type: "server_vad", threshold: 0.4, prefix_padding_ms: 400, silence_duration_ms: 1000, }, ```
I think it will always feel unnatural as long as 'AI Speech' is turn based. Right now developers used Voice Activity Detection to detect when the user has stopped talking.
What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.
I can see how interruptions would prove even more unnatural and annoying pretty quick. There's a lot of nuance in knowing how to interrupt properly and often, people that interrupt only do so quickly, then yield, allow person to finish then resume - very situational and tons of nuance. Otherwise, with current level of sophistication, you'd just have the AI talking over you the entire time, not allowing you to complete your thoughts/questions/commands/etc and people would quickly be more frustrated and just turn it off.
I absolutely agree with your analysis wrt current tech - however, I suspect the person you're replying to is talking about "what would be really cool" in terms of it happening in a future where the relevant underpinnings had advanced to the point where it could actually manage the situational/nuance stuff properly.
I almost certainly wouldn't want to use something that tried to implement it now but it's a lovely dream and the state of the art keeps advancing at quite the speed (i.e. faster than I would have predicted, even when I do my best to take into account that it keeps advancing faster than I would have predicted ;).
Have you recently tried OpenAI voice mode from ChatGPT Plus? It's basically what you describe
Yes, I mentioned this in the comment.
I think it is closer, although still even it has a cold start problem. Once you are connected and in-session, it is a better experience.
There is still some "turn based" conversational aspect to it that can be awkward but it is much better. It also helps that you can "tap and hold" to override, which is a bit of a hack but works well in practice for that mobile use-case.
Sorry what I meant was: have you tried _recently_? I feel that what you describe is the old implementation. The current version, with a colorful bubble, has very low latency and starts almost instantly. The older version has a UI composed of a white dot on a black background, and is pretty slow. But the more recent one is pure magic IMHO.