Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.
uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):
This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.
I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.
But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.
The team at Astral should be nominated for a Nobel Peace Prize.
Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.
whisper is definitely nice, but it's a bit too slow.
Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.
It enables dictation that actually works and it's as fast as you can think.
I also have a set of scripts which just wait for voice commands and do things.
I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.
Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.
I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.
But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.
I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?
I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)
You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).
Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/
Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.
I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.
Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
Also some social media platforms don't offer subtitle functionality, so burned-in is the only way if you want to serve your content to people that require subtitles or refuse to unmute their phones while they watch from their toilet.
I did that (distracting subtitles) on one of my videos and it had a very negative response. I won't do it again, but I was puzzled because I find it much nicer than the traditional subtitle format personally. It's easier for my brain to focus on. (And no one in my test audience minded.)
I recently discovered that the Internet Archive has the Tomodachi fansubs of Fushigi Yugi which, at least in my experience, were the most famous example of that technique.
Algorithm boosts it that’s why they do it. Even if every device had real time 100% accurate subtitling built in they’d still do it if they video performs better with it.
The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?
True, but (as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around) sometimes burned-in seems like a good idea.
I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more. I think some allow cc but some don't, and if someone reshares to another platform, it may not be there, so some of us burn them in begrudgingly :-)
It's just so annyoing how someone like Netflix offers like 3-4 languages for most of its content when you can basically get it for free via browser extensions (if you watch on browser).
That Netflix who would need to pay more to license more subtitles can't compete with pirated or unlicensed auto-generated subtitles shouldn't really be a surprise.
It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.
Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
This is what your brain does when it processes language.
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
queue
The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.
Thanks, I was being tripped up by DDOS protection on code.ffmpeg.org for a minute and couldn't read the patch. The combo of Firefox and the fact that Quantum/Lumen/CenturyLink seems to get off by rotating my dynamic IP for no reason occasionally triggers various DDOS protections schemes.
I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).
Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
The LLM turns my unstructured command into structured command (a limited set of commands hardcoded in the prompt) and a script takes that and executes it. I have it do stuff like interact with google keep/google calendar using the CLI. Those are the most used actions but there's a few others . Of course all actions can be scheduled.
The LLM can screw up now and then and output absolute garbage. But I've got a knack now for figuring out what prompts it's gonna be hopeless on and I manually enter those.
Example:
Saying
Remove makhana from shopping list
Ends up running the command
gkeep items edit shopping_list --check makhana
There is a direct text interface too that skips the voice transcription.
The main thing is it does in a background window without interrupting my screen or me needing to wait for whatever slow webpage to load. I had it do a few things on GitHub like remind me when checks pass on PRs. You could potentially connect it to various things like your amazon account to check on your order, etc,.. as I write this I now realise I did what basically amounts to what folks do with MCP today. Maybe I should update it to use the protocol.
These days I have a little more idle time as a grad student than I did in a tech company, and I don't really need to manage home/cooking/... so I don't really use some of the more complicated features. I mostly just use it to schedule 1on1s with my guide and add reminders about assignments and TA work and talks and my music class.
I know nothing about Whisper, is this usable for automated translation?
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.
I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.
Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.
My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.
It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
Hey, indeed Whisper can do the transcription of Japanese and even the translation (but only to English). For the best results you need to use the largest model which depending on your hardware might be slow or fast.
Another option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for
Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.
I wish they worked with the mpv folks instead of shoehorning this in. Based on the docs it looks like getting live transcription for a video will involve running the demuxer/decoder on one thread, and this whisper filter on another thread, using ffmpeg's AVIO (or to a REST API [1].... shudders) to synchronize those two parallel jobs. It could have been way simpler.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.
"Making sure you're not a bot!" with no way to get to the actual document that is supposed to be at the URL. Anubis can be configured to be accessible for people without the latest computers by using the meta-refresh proof of work but very few people take any time to configure it and just deploy the defaults. Just like with cloudflare.
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
While that would be nicer from an end-user perspective, it's something hard to maintain for FFmpeg itself. Consider the velocity of the whisper-cpp project. I'm sure that – just like with filters such as vmaf, which also require building a dependency and downloading a model – precompiled versions will become available for novice users to directly download. Especially considering whisper-cpp is MIT-licensed.
From experience, these bot filters are usually installed because the site would be down entirely without rejecting AI scrapers, so the argument to shut it off to improve usability is rather silly.
They don't need to shut off Anubis, they just need to configure it beyond the defaults. If they turned on the meta-refresh based challenge then all browsers could access it while still keeping most of the bots away. But few people ever configure these things and just accept the broken defaults.
With the current broken default config my browser can't even run the JS challenge due to it using unsupported bleeding edge JS features.
Hi, can you please paste the error message you get? This should be using features that are supported widely as of 2022 and I regularly test on Firefox LTS.
Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't support Javascript. The linked page is just a git viewer for that specific commit.
Took about 30 secs for me (5 yr old intel cpu). Looked like there was a progress bar, but it didn't progress. Maybe the difficulty varies depending on IP address?
I don't think installing (i.e. compiling) whisper.cpp and using it to do audio-to-text is very difficult. If the documentation is too technical I am sure you can ask some LLM to walk you through it. I have used it on Android in termux and on my FreeBSD desktop computer. Would not expect any difficulties on any modern Linux.
Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.
I don't understand how this would happen, though. It's not like it will mishear a dutch sentence as if it's english; it will correctly pick up the dutch sentence, but (since the language is auto-detected as english at the start of the segment), seemingly auto-translate that (correct and correctly heard) dutch text to english. All we need is a way to get the dutch text that's surely somewhere in there, before the translation happens.
Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
Isn't that a bit much for ASR models? Humans can't handle simultaneous multilingual dictation task either, I have to stop and reinitialize ears before switching languages between English and my primary one.
In South Asia, it's quite common for people to speak a combination of their local language and English. Not just alternating sentences between the two languages, but in fact, constructing sentences using compound phrases from the two languages.
"Madam, please believe me, maine homework kiya ha" [I did my homework].
I found that it works quite well for Dutch+English as long as you use one of the larger models. But that may just be luck, I imagine mixing Italian and Swedish will have very different results.
I know it is ostensibly multilingual, it's less than a year since I tried, but it does this thing where it then translates everything (or only some things) into a single language regardless with no way to turn it off.
If set, the transcription output will be sent to the specified file or URL
(use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
The output will also be set in the "lavfi.whisper.text" frame metadata.
If the destination is a file and it already exists, it will be overwritten.
@item format
The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
Default value: @code{"text"}
I don't know if this can embed the subtitles, but it does support generating accompanying srt files.
Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
In my experience, a small/tiny whisper model has pretty okay English decoding speed on something relatively modern even without GPU support. There's a bunch of latency in the process (because of technological limitations) but the optimised C++ version shouldn't pose too much of a problem unless you're running in power saving mode. Battery life may be a problem on older laptops, though.
I've been waiting a while now for automatic translated subtitles in vlc. I thought it would be here by now. I'm probably underestimating the difficulty but I'm surprised some video player hasn't done it by now. (as far as I know).
I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.
Whisper is supposed to be used with voice activity detection and all production implementations that I've seen do that. The raw model is known to make up nonsense for silence because, as I understand it, it was never trained not to do that, assuming everyone will use VAD
I've heard that, but that doesn't sound like a useful approach for videos where (1) non-speech segments can have plenty of other sound (music, noise) and (2) you want timestamps to match up with the original video, like for subtitles. But maybe there are known mitigations for both of those issues that I'm not aware of. And if they do exist maybe they can be included in the ffmpeg whisper integration.
took me longer than i'd care to admit to figure out how to install whisper as a user/system package on macOS w/o brew (which pulls in all of llvm@16 during install)
brew install uv
uv tool install openai-whisper
then add ~/.local/bin/ to $PATH
May I ask, if there is a movie where English people speak English, French people speak French, and German people speak German, is there a software that can generate subtitles in English, French and German without translating anything? I mean, just record what it hears.
I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.
I solved it by generating English subtitles, then passing those to an LLM in chunks that are ~20 entries in size. Include preceding and following subtitles as context for better translation. Make sure to replace the timestamps with simple integer ids, because LLMs like to mangle those, no matter how hard you prompt.
I could share a python script that is working pretty reliably for me.
I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.
How could one in theory, use this to train on a new language?
Say for a hubby project; I have recordings of some old folks stories in my local dialect.
as someone who has a live application using whisper and ffmpeg, this does seem like just feature creep. ffmpeg and whisper both are otherwise well limited CLI tools adhering to the unix philosophy, this ... idk
There are many streaming ASR models based on CTC or RNNT. Look for example at sherpa (https://github.com/k2-fsa/sherpa-onnx), which can run streaming ASR, VAD, diarization, and many more.
At least whisper.cpp only supports a few input formats like WAV and MP3. To get subtitles for videos I always have to first run ffmpeg to get an audio file and then run whisper.cpp. Guess this new feature may mean that I can do it in just one step, so slightly more convenient?
I see, thanks. I actually do almost all my Whisper work with ogg files, and got into a snag recently with m4a files. Transcoding to an equivalent size ogg or mp3 killed the quality, and wav is too big. Maybe FFmpeg could be of service here.
I run a service that does transcriptions as part of the pipeline, and I use ffmpeg for other parts (such as speeding up audio). Having it all on a single command might make sense for some people if the costs work out.
It’s fairly easy to get diarizarion working with pyannote.audio and https://huggingface.co/pyannote/speaker-diarization-3.1 with ffmpeg converting the audio first to 16kHz mono WAV file but it really depends a lot on the audio - two person podcast where the speakers allow each other space works but lots of people with overlapping voices on the audio - not so great
Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.
https://www.nikse.dk/subtitleedit
https://www.nikse.dk/donate
https://github.com/SubtitleEdit/subtitleedit/releases
> uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):
> uv pip install torch torchvision torchaudio --torch-backend=auto
More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...
This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.
I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.
But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.
The team at Astral should be nominated for a Nobel Peace Prize.
3 replies →
Of all the great things people say about UV, this is the one that sold me on it when I found this option in the docs. Such a nice feature.
Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.
whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.
It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.
The main limitation is being english only.
Would you share the scripts?
1 reply →
Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.
2 replies →
Can you give an example why it made your life that much better?
I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
2 replies →
Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.
But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.
13 replies →
I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
2 replies →
whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?
I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)
You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).
Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/
Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.
Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.
I found this online demo of it: https://www.nikse.dk/subtitleedit/online
You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?
Check out https://github.com/jhj0517/Whisper-WebUI
I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.
WhipserX's diarization is great imo:
Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.
1 reply →
Proper diarization still remains a white whale for me, unfortunately.
Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].
[1]: https://github.com/pyannote/pyannote-audio
1 reply →
Is there a way to use it to generate a srt subtitle file given a video file?
It generates a few formats by default including srt
you can install suing winget or chocolately
Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
They do that because it increases “engagement”, not because they care about the user’s experience with the subtitles.
Also some social media platforms don't offer subtitle functionality, so burned-in is the only way if you want to serve your content to people that require subtitles or refuse to unmute their phones while they watch from their toilet.
I did that (distracting subtitles) on one of my videos and it had a very negative response. I won't do it again, but I was puzzled because I find it much nicer than the traditional subtitle format personally. It's easier for my brain to focus on. (And no one in my test audience minded.)
2 replies →
Those burned in subtitles still aren’t as cool as theme-matched anime subtitles during intro music sequences from fansubs 15 years ago.
Those are still cool IMO
Or how the fansubbers will create masks to translate diegetic text like signage and written notes
1 reply →
I recently discovered that the Internet Archive has the Tomodachi fansubs of Fushigi Yugi which, at least in my experience, were the most famous example of that technique.
https://archive.org/details/tomodachi-fushigi-yugi-vhsrip
Algorithm boosts it that’s why they do it. Even if every device had real time 100% accurate subtitling built in they’d still do it if they video performs better with it.
I think this trend is partially driven by the silent auto play that happens on YouTube. Baked in subtitles help draw people into the video.
The other problem with burned-in subtitles is you can't change the language.
The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?
True, but (as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around) sometimes burned-in seems like a good idea.
I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
4 replies →
They could also just upload those transcriptions as normal closed-captioning srt subtitles...
not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more. I think some allow cc but some don't, and if someone reshares to another platform, it may not be there, so some of us burn them in begrudgingly :-)
It's just so annyoing how someone like Netflix offers like 3-4 languages for most of its content when you can basically get it for free via browser extensions (if you watch on browser).
Must be union thing.
That Netflix who would need to pay more to license more subtitles can't compete with pirated or unlicensed auto-generated subtitles shouldn't really be a surprise.
It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.
1 reply →
[dead]
Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
A good opportunity to point people to the paper with my favorite title of all time:
"How to wreck a nice beach you sing calm incense"
https://dl.acm.org/doi/10.1145/1040830.1040898
For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"
17 replies →
The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
(Agree that the title is awesome, by the way!)
Direct PDF download link:
https://web.media.mit.edu/~lieber/Publications/Wreck-a-Nice-...
Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
I have a British accent, for the record.
My favorite is:
"Threesomes, with and without blame"
https://dl.acm.org/doi/10.1145/1570506.1570511
(From a professor I worked with a bit in grad school)
Also relevant: The Two Ronnies - "Four Candles"
https://www.youtube.com/watch?v=gi_6SaqVQSw
Do AI voice recognition still use markov models for this?
1 reply →
This is what your brain does when it processes language.
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
6 replies →
It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
14 replies →
I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
3 replies →
Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.
The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
9 replies →
Whisper is excellent, but not perfect.
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
5 replies →
So, yes, and also no.
I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun
The I is emphasized more in I scream than ice cream I think.
But it’s great point that you need context to be sure.
what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k
Related, a blog article by the author of the patch:
Run Whisper audio transcriptions with one FFmpeg command
https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254
Link is broken, full link: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
Correct URL: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.
https://developer.apple.com/documentation/speech/speechtrans...
https://developer.apple.com/documentation/speech/speechanaly...
https://www.macstories.net/stories/hands-on-how-apples-new-s...
Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
yep, there's a c++ implementation to run it https://github.com/ggml-org/whisper.cpp
Isn't WhisperX the canonical choice for running Whisper?
2 replies →
Yes.
From the documentation:
> It runs automatic speech recognition using the OpenAI's Whisper model.
Thanks, I was being tripped up by DDOS protection on code.ffmpeg.org for a minute and couldn't read the patch. The combo of Firefox and the fact that Quantum/Lumen/CenturyLink seems to get off by rotating my dynamic IP for no reason occasionally triggers various DDOS protections schemes.
1 reply →
Kind of, it's a family of audio transcription models.
https://huggingface.co/search/full-text?q=whisper
I think so, if I remember correctly PotPlayer also supports it for automatic subtitling.
Yes, according to the comments in the patch, you are correct.
yes.
I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.
I had a small bash pipeline for doing this until now.
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).
Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
Yes, please, go on...
The LLM turns my unstructured command into structured command (a limited set of commands hardcoded in the prompt) and a script takes that and executes it. I have it do stuff like interact with google keep/google calendar using the CLI. Those are the most used actions but there's a few others . Of course all actions can be scheduled.
The LLM can screw up now and then and output absolute garbage. But I've got a knack now for figuring out what prompts it's gonna be hopeless on and I manually enter those.
Example:
Saying
Remove makhana from shopping list
Ends up running the command
gkeep items edit shopping_list --check makhana
There is a direct text interface too that skips the voice transcription.
The main thing is it does in a background window without interrupting my screen or me needing to wait for whatever slow webpage to load. I had it do a few things on GitHub like remind me when checks pass on PRs. You could potentially connect it to various things like your amazon account to check on your order, etc,.. as I write this I now realise I did what basically amounts to what folks do with MCP today. Maybe I should update it to use the protocol.
These days I have a little more idle time as a grad student than I did in a tech company, and I don't really need to manage home/cooking/... so I don't really use some of the more complicated features. I mostly just use it to schedule 1on1s with my guide and add reminders about assignments and TA work and talks and my music class.
2 replies →
I know nothing about Whisper, is this usable for automated translation?
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.
I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.
I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...
Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.
2 replies →
In my experience it works ok. The "English" model actually knows a lot of languages and will translate directly to English.
You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.
For example, using faster-whisper-xxl [1]:
Direct translation:
Use Japanese, then translate:
1. https://github.com/Purfview/whisper-standalone-win
My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.
It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
It's curious how YouTube's is so bad still given the current state of the art; but it has got a lot better in the last 6 months.
Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio.
It's decent for classification but poor at transcription.
Pre-processing with a vocal extraction model (bs-rofomer or similar) helps a lot with the hallucinations, especially with poor quality sources.
1 reply →
Hey, indeed Whisper can do the transcription of Japanese and even the translation (but only to English). For the best results you need to use the largest model which depending on your hardware might be slow or fast.
Another option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for
Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.
May I ask which movies? I'm just curious
I wish they worked with the mpv folks instead of shoehorning this in. Based on the docs it looks like getting live transcription for a video will involve running the demuxer/decoder on one thread, and this whisper filter on another thread, using ffmpeg's AVIO (or to a REST API [1].... shudders) to synchronize those two parallel jobs. It could have been way simpler.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
[1] https://news.ycombinator.com/item?id=44890067
I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.
Is this website open? Would love to see your work :P
somerville.votolab.com
2 replies →
I wanted to do this for my local county council meetings. I think in this context speaker recognition would be important.
"Making sure you're not a bot!" with no way to get to the actual document that is supposed to be at the URL. Anubis can be configured to be accessible for people without the latest computers by using the meta-refresh proof of work but very few people take any time to configure it and just deploy the defaults. Just like with cloudflare.
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
While that would be nicer from an end-user perspective, it's something hard to maintain for FFmpeg itself. Consider the velocity of the whisper-cpp project. I'm sure that – just like with filters such as vmaf, which also require building a dependency and downloading a model – precompiled versions will become available for novice users to directly download. Especially considering whisper-cpp is MIT-licensed.
Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc
If they want to support it out-of-the box, they'll still have to embed a model file (roughly 500 MB - 3GB, varying size and quality)
Can't you point ffmpeg to a model file using some preferences dialog?
Shut off the broken bot filter so we can read it please
From experience, these bot filters are usually installed because the site would be down entirely without rejecting AI scrapers, so the argument to shut it off to improve usability is rather silly.
They don't need to shut off Anubis, they just need to configure it beyond the defaults. If they turned on the meta-refresh based challenge then all browsers could access it while still keeping most of the bots away. But few people ever configure these things and just accept the broken defaults.
With the current broken default config my browser can't even run the JS challenge due to it using unsupported bleeding edge JS features.
Hi, can you please paste the error message you get? This should be using features that are supported widely as of 2022 and I regularly test on Firefox LTS.
2 replies →
Archived snapshots of the linked page:
https://web.archive.org/web/20250813104007/https://code.ffmp...
https://archive.is/dmj17
You can read it on one of these without having to pass that specific bot check
Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't support Javascript. The linked page is just a git viewer for that specific commit.
Or read the documentation for the new whisper filter: https://ffmpeg.org/ffmpeg-filters.html#whisper-1
2 replies →
Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What hardware/OS are you using?
Took me zero seconds to be blocked with invalid response
3 replies →
Took about 30 secs for me (5 yr old intel cpu). Looked like there was a progress bar, but it didn't progress. Maybe the difficulty varies depending on IP address?
3 replies →
The stock chrome browser Google news uses
Took me 8 seconds on my shitty desktop.
Annoyingly, something is broken with their anti not stuff, as it keeps refusing to let me see the page.
I wonder if they'll be satisfied there or add a chunk of others now that they've started. Parakeet is supposed to be good?
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
Voice Activity Detection support is already included.
Parakeet is indeed really awesome.
Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...
It looks like the model file needs to be supplied at invocation time, so the binary blob would not be required for packaging.
so 'apt install ffmpeg' won't be enough to have the feature?
1 reply →
on an aside, my favorite whisper 'hack' is you can just speed up audio 10x to process it 10x faster, then adjust the timings after
This would be really cool to use to implement live transcripts
Is Whisper still SOTA 3 years later? It does not seem there is a clearly better open model. Alec Radford really is a genius!
Looks like there's a leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
NVIDIA Nemo Parakeet for English. Mistral’s recent Voxtral is supposed to be nice and open source
3 years later and Youtube CCs is still horrible lol
How can I run Whisper or this software in Linux or Android as a non-technical user?
Basically a simple audio-to-text for personal use?
https://handy.computer "the free and open source app for speech to text" is available for Linux (and Windows/Mac).
I don't think installing (i.e. compiling) whisper.cpp and using it to do audio-to-text is very difficult. If the documentation is too technical I am sure you can ask some LLM to walk you through it. I have used it on Android in termux and on my FreeBSD desktop computer. Would not expect any difficulties on any modern Linux.
Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.
I think the Dutch/English is probably the worst combination for this. Languages are rather close.
I don't understand how this would happen, though. It's not like it will mishear a dutch sentence as if it's english; it will correctly pick up the dutch sentence, but (since the language is auto-detected as english at the start of the segment), seemingly auto-translate that (correct and correctly heard) dutch text to english. All we need is a way to get the dutch text that's surely somewhere in there, before the translation happens.
Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
2 replies →
Isn't that a bit much for ASR models? Humans can't handle simultaneous multilingual dictation task either, I have to stop and reinitialize ears before switching languages between English and my primary one.
In South Asia, it's quite common for people to speak a combination of their local language and English. Not just alternating sentences between the two languages, but in fact, constructing sentences using compound phrases from the two languages.
"Madam, please believe me, maine homework kiya ha" [I did my homework].
2 replies →
Seems like it already has the capability somewhere in the model though - see my reply to clarionbell.
Isn't that exactly what intepreters do?
1 reply →
I found that it works quite well for Dutch+English as long as you use one of the larger models. But that may just be luck, I imagine mixing Italian and Swedish will have very different results.
Best for English, but I've found it pretty decent for Spanish.
It's even better for some languages other than English (e. g. Spanish), see: https://github.com/openai/whisper?tab=readme-ov-file#availab...
Whisper has been multilingual for 5 years at least.
I know it is ostensibly multilingual, it's less than a year since I tried, but it does this thing where it then translates everything (or only some things) into a single language regardless with no way to turn it off.
1 reply →
Except it’s only been released in September 2022 (not even 3 years ago).
Whisper-v3 works well for multi-lingual. I tried it with Dutch, German and English
Does this finally enable dynamically generating subtitles for movies with AI?
Docs say:
I don't know if this can embed the subtitles, but it does support generating accompanying srt files.
Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
If you have enough processing power. Without a GPU it's going to lag.
In my experience, a small/tiny whisper model has pretty okay English decoding speed on something relatively modern even without GPU support. There's a bunch of latency in the process (because of technological limitations) but the optimised C++ version shouldn't pose too much of a problem unless you're running in power saving mode. Battery life may be a problem on older laptops, though.
Whisper is pretty fast.
Finally? I think VLC demo'd this a while ago at some conference where they had a table, if I remember correctly.
VLC and ffmpeg are unrelated projects
3 replies →
I've been waiting a while now for automatic translated subtitles in vlc. I thought it would be here by now. I'm probably underestimating the difficulty but I'm surprised some video player hasn't done it by now. (as far as I know).
1 reply →
I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.
Whisper is supposed to be used with voice activity detection and all production implementations that I've seen do that. The raw model is known to make up nonsense for silence because, as I understand it, it was never trained not to do that, assuming everyone will use VAD
You usually delete silence before using something like whisper.
I've heard that, but that doesn't sound like a useful approach for videos where (1) non-speech segments can have plenty of other sound (music, noise) and (2) you want timestamps to match up with the original video, like for subtitles. But maybe there are known mitigations for both of those issues that I'm not aware of. And if they do exist maybe they can be included in the ffmpeg whisper integration.
2 replies →
This is designed for real time use too. And in such cases, you couldn’t delete the silence before use.
1 reply →
I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English.
took me longer than i'd care to admit to figure out how to install whisper as a user/system package on macOS w/o brew (which pulls in all of llvm@16 during install)
May I ask, if there is a movie where English people speak English, French people speak French, and German people speak German, is there a software that can generate subtitles in English, French and German without translating anything? I mean, just record what it hears.
[dead]
I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.
Anyone found a way?
I solved it by generating English subtitles, then passing those to an LLM in chunks that are ~20 entries in size. Include preceding and following subtitles as context for better translation. Make sure to replace the timestamps with simple integer ids, because LLMs like to mangle those, no matter how hard you prompt.
I could share a python script that is working pretty reliably for me.
I'd love to see that script, do you have a link?
1 reply →
OH: "New changelog entries go to the bottom, @vpalmisano .. Didn't I tell you this once?"
Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this.
https://github.com/kavehtehrani/gnome-speech2text
Why is this a Gnome extension? I would love to use this in KDE.
Likely because they are a GNOME user and the APIs are DE specific.
I use Ubuntu 24.04 and comes with GNOME Shell.
Did ffmpeg move their bug tracker to Forgejo?
https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
I still see their old one too, but Forgejo one is nice.
I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.
The only item that was discussed was that the subtitle workflow does not seem to be that good, afaict:
https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...
You'd be surprised what's in there, a few forms of NNs are already supported for denoising, speech detection.
I think having this flow out to all of the deps of libav is a greater good than notions of lib purity.
It failed to identify me as a human twice before let me access the page
Is anyone able to get streaming audio to text conversion working with whisper.cpp?
I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.
Anyone got this to compile on macOS yet? The homebrew binary doesn't yet (and probably won't ever) include the --enable-whisper compile option.
"multi-modal feature extraction → semantic translation → cross-modal feature transfer → precise temporal alignment," is all we need
How could one in theory, use this to train on a new language? Say for a hubby project; I have recordings of some old folks stories in my local dialect.
│
└── Dey well; Be well
https://huggingface.co/blog/fine-tune-whisper
More precisely libavfilter, so it's also soon in mpv and other dependent players.
This is going to be great for real-time audio translation.
as someone who has a live application using whisper and ffmpeg, this does seem like just feature creep. ffmpeg and whisper both are otherwise well limited CLI tools adhering to the unix philosophy, this ... idk
I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time.
Whisper has the encoder-decoder architecture, so it's hard to run streaming efficiently, though whisper-streaming is a thing.
https://kyutai.org/next/stt is natively streaming STT.
There are many streaming ASR models based on CTC or RNNT. Look for example at sherpa (https://github.com/k2-fsa/sherpa-onnx), which can run streaming ASR, VAD, diarization, and many more.
Unrelated, but can I use Whisper in DaVinci resolve to automatically transcribe my videos and add subs?
Unrelated, but why isn’t Europe a country already. It’s been ages!
Aww, I literally just implemented this using whisper.cpp and ffmpeg lib, code is even similar...
Labeling multiple people talking is something i found lacking with whisper, is it better now?
Why would one use FFmpeg with Whisper support, instead of using Whisper directly?
At least whisper.cpp only supports a few input formats like WAV and MP3. To get subtitles for videos I always have to first run ffmpeg to get an audio file and then run whisper.cpp. Guess this new feature may mean that I can do it in just one step, so slightly more convenient?
I see, thanks. I actually do almost all my Whisper work with ogg files, and got into a snag recently with m4a files. Transcoding to an equivalent size ogg or mp3 killed the quality, and wav is too big. Maybe FFmpeg could be of service here.
I run a service that does transcriptions as part of the pipeline, and I use ffmpeg for other parts (such as speeding up audio). Having it all on a single command might make sense for some people if the costs work out.
Terrific, thank you.
What's the benefit VS using whisper as a separate tool?
That's great. How does Whisper compare to Google Gemini's transcription capabilities?
Can't view site. Some sort of misconfigured CAPTCHA bullshit.
Does this whisper also do text-to-speech?
No
Now if it only did separate speaker identification (diarization)
It’s fairly easy to get diarizarion working with pyannote.audio and https://huggingface.co/pyannote/speaker-diarization-3.1 with ffmpeg converting the audio first to 16kHz mono WAV file but it really depends a lot on the audio - two person podcast where the speakers allow each other space works but lots of people with overlapping voices on the audio - not so great
hell yeah
[dead]
[dead]
[dead]
Very interesting to see this!
[dead]