← Back to context

Comment by vunderba

3 days ago

Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

Public gist in case anyone finds it useful:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.

  • Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.

thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?

  • So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.