← Back to context

Comment by watchlight

13 hours ago

Transcription is a portion of my own pipeline, and I was a bit surprised to learn at how many of the short words were cut out as a result of VAD (voice activity detection) being overly sensitive. It wasn't like the models were bad, rather just the activity detection wasn't tuned correctly for background noise. The start and end of words would also occasionally get truncated, "um" and "uh" just often happen to be something that folks mutter. They're sufficiently short that the blips aren't considered "real enough speech" to get flagged.

Faster-whisper does allow for VAD tuning via API. I don't think it's exposed natively for Whisper though.