← Back to context

Comment by genewitch

8 hours ago

I've anecdotally tested translations by ripping the video with subtitles and having whisper subtitle it, and also asking several AI to translate the .srt or .vtt file (subtotext I think does this conversion if you don't wanna waste tokens on the metadata)

Whisper large-v3, the largest model I have, is pretty good, getting nearly identical translations to chatgpt or whatever, Google's default speech to text. The fun stuff is when you ask for text to text translations from LLMs.

I did a real small writeup with an example but I don't have a place to publish nor am I really looking for one.

I used whisper to transcribe nearly every "episode" of the Love Line syndicated radio show from 1997-2007 or so. It took, iirc, several days. I use it to grep the audio, as it were. I intend to do the same with my DVDs and such, just so I never have to Google "what movie / tv show is that line from?" I also have a lot of art bell shows, and a few others to transcribe.

> I used whisper to transcribe nearly every "episode" of the Love Line syndicated radio show from 1997-2007 or so.

Yes - second this. I found 'Whisper' great for that type of scenario as well.

A local monastery had about 200 audio talks (mp3). Whisper converted them all to text and GPT did a small 'smoothing' of the output to make it readable. It was about half a million words and only took a few hours.

The monks were delighted - they can distribute their talks in small pamplets / PDFs now and is extra income for the community.

Years ago as a student I did some audio transcription manually and something similar would have taken ages...

  • I actually was asked by Vermin Supreme to hand-caption some videos, and i instantly regretted besmirching the existing subtitles. I was correct, the subtitles were awful, but boy, the thought of hand-transcribing something with Subtitle Edit had me walking that back pretty quick - and this was for a 4 minute video - however it was lyrical over music, so AI barely gave a starting transcription.

I wanted this to work with Whisper, but the language I tried it with was Albanian and the results were absolutely terrible - not even readable English. I'm sure it would be better with Spanish or Japanese.

  • According to the Common Voice 15 graph on OpenAI's github repository, Albanian is the single worst performance you could have had: https://github.com/openai/whisper

    But for what it's worth, I tried putting the YouTube video of Tom Scott presenting at the Royal Institute into the model, and even then the results were only "OK" rather than "good". When even a professional presenter and professional sound recording in a quiet environment has errors, the model is not really good enough to bother with.