Comment by nvdnadj92
21 days ago
I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
I added EQ to a task after reading this and got much more accurate and consistent results using whisper, thanks for the obvious in retrospect tip.
Please can you share the prompt you use in ChatGPT to remove hallucination chunks
If I understood correctly, VAD has superior results than using ffmpeg silencedetect + silentremove, right?
I think latest version of ffmpeg could use whisper with VAD[1], but I still need to explore how with a simple PoC script
I'd love to know more about the post-processing prompt, my guess is that looks like an improved version of `semantic correction` prompt[2], but I may be wrong ¯\_(ツ)_/¯ .
[1] https://ffmpeg.org/ffmpeg-filters.html#toc-whisper-1
[2] https://gist.github.com/eevmanu/0de2d449144e9cd40a563170b459...