Comment by genewitch
13 hours ago
I've anecdotally tested translations by ripping the video with subtitles and having whisper subtitle it, and also asking several AI to translate the .srt or .vtt file (subtotext I think does this conversion if you don't wanna waste tokens on the metadata)
Whisper large-v3, the largest model I have, is pretty good, getting nearly identical translations to chatgpt or whatever, Google's default speech to text. The fun stuff is when you ask for text to text translations from LLMs.
I did a real small writeup with an example but I don't have a place to publish nor am I really looking for one.
I used whisper to transcribe nearly every "episode" of the Love Line syndicated radio show from 1997-2007 or so. It took, iirc, several days. I use it to grep the audio, as it were. I intend to do the same with my DVDs and such, just so I never have to Google "what movie / tv show is that line from?" I also have a lot of art bell shows, and a few others to transcribe.
> I used whisper to transcribe nearly every "episode" of the Love Line syndicated radio show from 1997-2007 or so.
Yes - second this. I found 'Whisper' great for that type of scenario as well.
A local monastery had about 200 audio talks (mp3). Whisper converted them all to text and GPT did a small 'smoothing' of the output to make it readable. It was about half a million words and only took a few hours.
The monks were delighted - they can distribute their talks in small pamplets / PDFs now and is extra income for the community.
Years ago as a student I did some audio transcription manually and something similar would have taken ages...
I actually was asked by Vermin Supreme to hand-caption some videos, and i instantly regretted besmirching the existing subtitles. I was correct, the subtitles were awful, but boy, the thought of hand-transcribing something with Subtitle Edit had me walking that back pretty quick - and this was for a 4 minute video - however it was lyrical over music, so AI barely gave a starting transcription.
I wanted this to work with Whisper, but the language I tried it with was Albanian and the results were absolutely terrible - not even readable English. I'm sure it would be better with Spanish or Japanese.
According to the Common Voice 15 graph on OpenAI's github repository, Albanian is the single worst performance you could have had: https://github.com/openai/whisper
But for what it's worth, I tried putting the YouTube video of Tom Scott presenting at the Royal Institute into the model, and even then the results were only "OK" rather than "good". When even a professional presenter and professional sound recording in a quiet environment has errors, the model is not really good enough to bother with.