← Back to context

Comment by rfv6723

10 hours ago

Human spoken conversation doesn’t really work like file buffering.

People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.

For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.

> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

LLMs are surprisingly good at this, too.

This entire blog post is based on assumptions that

1) WebRTC garbling is common

2) LLMs fall apart if there are any audio glitches

I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons

I think this is mixing domains quite a bit;

If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.

Where on this scale does "person talking to LLM" fit?

I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.

You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)

Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.