← Back to context

Comment by pocksuppet

9 hours ago

Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.

You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard

  • > And dynamically adjusting this jitter buffer is hard

    Unappreciated part of this entire conversation.

  • I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?

      x...y...y[dedup]...z

    • You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).