Comment by kmfrk

10 days ago

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

52 comments

kmfrk

notatallshaw 10 days ago

> uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):

> uv pip install torch torchvision torchaudio --torch-backend=auto

More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...

This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.

xrd 10 days ago

I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.

But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.

The team at Astral should be nominated for a Nobel Peace Prize.

danudey 10 days ago

> "uv add"

One life-changing thing I've been using `uv` for:

System python version is 3.12:

    $ python3 --version
    Python 3.12.3

A script that requires a library we don't have, and won't work on our local python:

    $ cat test.py
    #!/usr/bin/env python3

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")

It fails:

    $ python3 test.py
    Traceback (most recent call last):
    File "/tmp/tmp/test.py", line 10, in <module>
        from rich import print
    ModuleNotFoundError: No module named 'rich'

Tell `uv` what our requirements are

    $ uv add --script=test.py --python '3.13' rich
    Updated `test.py`

`uv` updates the script:

    $ cat test.py
    #!/usr/bin/env python3
    # /// script
    # requires-python = ">=3.13"
    # dependencies = [
    #     "rich",
    # ]
    # ///

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")

`uv` runs the script, after installing packages and fetching Python 3.13

    $ uv run test.py
    Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
    Downloading cpython-3.13.5-linux-x86_64-gnu (download)
    Installed 4 packages in 7ms
    Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]

And if we run it with Python 3.12, we can see that errors:

    $ uv run --python 3.12 test.py
    warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
    Installed 4 packages in 7ms
    This script will not work on Python 3.12

Works for any Python you're likely to want:

    $ uv python list
    cpython-3.14.0b2-linux-x86_64-gnu                 <download available>
    cpython-3.14.0b2+freethreaded-linux-x86_64-gnu    <download available>
    cpython-3.13.5-linux-x86_64-gnu                   /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
    cpython-3.13.5+freethreaded-linux-x86_64-gnu      <download available>
    cpython-3.12.11-linux-x86_64-gnu                  <download available>
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3.12
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3 -> python3.12
    cpython-3.11.13-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
    cpython-3.10.18-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
    cpython-3.9.23-linux-x86_64-gnu                   <download available>
    cpython-3.8.20-linux-x86_64-gnu                   <download available>
    pypy-3.11.11-linux-x86_64-gnu                     <download available>
    pypy-3.10.16-linux-x86_64-gnu                     <download available>
    pypy-3.9.19-linux-x86_64-gnu                      <download available>
    pypy-3.8.16-linux-x86_64-gnu                      <download available>
    graalpy-3.11.0-linux-x86_64-gnu                   <download available>
    graalpy-3.10.0-linux-x86_64-gnu                   <download available>
    graalpy-3.8.5-linux-x86_64-gnu                    <download available>

eigenvalue 10 days ago

They’ve definitely saved me many hours of wasted time between uv and ruff.
j45 10 days ago

Agreed, making the virtual environment management and so much else disappear lets so much more focus go to python itself.

spagettnet 9 days ago

Of all the great things people say about UV, this is the one that sold me on it when I found this option in the docs. Such a nice feature.

tossit444 10 days ago

Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.

jokethrowaway 10 days ago

whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.

It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.

The main limitation is being english only.

threecheese 10 days ago
Would you share the scripts?
- ec109685 10 days ago
  
  Or at least more details. Very cool!

forgingahead 10 days ago

Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.

rhdunn 9 days ago

You only need the ASR bits -- this is where I got to when I previously looked into running Parakeet:

    # NeMo does not run on 3.13+
    python3.12 -m venv .venv
    source .venv/bin/activate

    git clone https://github.com/NVIDIA/NeMo.git nemo
    cd nemo

    pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
    pip install .[asr]

    deactivate

Then run a transcribe.py script in that venv:

    import os
    import sys
    import nemo.collections.asr as nemo_asr

    model_path = sys.argv[1]
    audio_path = sys.argv[2]

    # Load from a local path...
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path)

    # Or download from huggingface ('org/model')...
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path)

    output = asr_moel.transcribe([audio_path])
    print(output[0])

With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.

You'll need to modify the python script to process the response and output it in a format you can use.

1 reply →

pawelduda 10 days ago

Can you give an example why it made your life that much better?

3036e4 10 days ago
I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
- peterleiser 10 days ago
  
  You'll probably like Whisper Live and it's browser extensions: https://github.com/collabora/WhisperLive?tab=readme-ov-file#...
  Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.
- theshrike79 9 days ago
  
  This, but I want a summary about the 3 hour video first before getting spending the time on it.
  Download -> generate subtitles -> feed to AI for summary works pretty well
kmfrk 10 days ago

Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.
But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
shrx 10 days ago
As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.
- dylan604 10 days ago
  
  IF the dialog is badly recorded or unintelligible speech, how would a transcription process get it correct?
  
  11 replies →
- 3036e4 10 days ago
  
  I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.
joshvm 10 days ago
I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
- randomflyer20 9 days ago
  
  this is similar to what you are saying: https://x.com/thekrishdesai/status/1955390536422134109
  
  1 reply →

taminka 10 days ago

whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?

briansm 10 days ago

I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)
ec109685 10 days ago

You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).

BrunoJo 10 days ago

Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/

codedokode 10 days ago

Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.

kanemcgrath 10 days ago

Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.

throwoutway 10 days ago

I found this online demo of it: https://www.nikse.dk/subtitleedit/online

Morizero 10 days ago

You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?

peterleiser 10 days ago

Check out https://github.com/jhj0517/Whisper-WebUI
I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.
jduckles 10 days ago
WhipserX's diarization is great imo:
whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2
Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.
- Morizero 10 days ago
  
  > input.mp3
  Thanks but I'm looking for live diarization.
kmfrk 10 days ago
Proper diarization still remains a white whale for me, unfortunately.
Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].
[1]: https://github.com/pyannote/pyannote-audio
- peterleiser 10 days ago
  
  I used diarization in https://github.com/jhj0517/Whisper-WebUI last night and once it downloads the model from HuggingFace it runs offline (it claims).

hart_russell 10 days ago

Is there a way to use it to generate a srt subtitle file given a video file?

prurigro 10 days ago

It generates a few formats by default including srt

guluarte 10 days ago

you can install suing winget or chocolately

    winget install --id=Nikse.SubtitleEdit  -e