Comment by amluto

1 year ago

Maybe silly question:

> jitter buffer [40ms]

Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.

2 comments

amluto

Olreich 1 year ago

Almost any gap in audio is detectable and sounds really bad. 40ms is a lot, but sending 40ms of silence is probably worse

amluto 1 year ago

Sounds bad to whom? I’m talking about the direction from user to AI, not the direction from AI to user. If some of the audio gets delayed on the way to the AI, the AI can be paused. If some of the audio gets delayed on the way to a human, the human can’t be paused, so some buffering is needed to reduce the risk of gaps.