Comment by vunderba

3 days ago

Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

Public gist in case anyone finds it useful:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

6 comments

vunderba

sovok 3 days ago

An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.

soulofmischief 3 days ago

Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.

Tokumei-no-hito 3 days ago

thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?

vunderba 3 days ago
So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.
- dragonwriter 2 days ago
  
  I think the 8B+ question was about parameter count (8 billion+ parameters), not quantization level (8 bits per weight).
  
  1 reply →