Comment by kmfrk

10 days ago

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

> uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):

> uv pip install torch torchvision torchaudio --torch-backend=auto

More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...

This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.

  • I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.

    But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.

    The team at Astral should be nominated for a Nobel Peace Prize.

    • > "uv add"

      One life-changing thing I've been using `uv` for:

      System python version is 3.12:

          $ python3 --version
          Python 3.12.3
      

      A script that requires a library we don't have, and won't work on our local python:

          $ cat test.py
          #!/usr/bin/env python3
      
          import sys
          from rich import print
      
          if sys.version_info < (3, 13):
              print("This script will not work on Python 3.12")
          else:
              print(f"Hello world, this is python {sys.version}")
      

      It fails:

          $ python3 test.py
          Traceback (most recent call last):
          File "/tmp/tmp/test.py", line 10, in <module>
              from rich import print
          ModuleNotFoundError: No module named 'rich'
      

      Tell `uv` what our requirements are

          $ uv add --script=test.py --python '3.13' rich
          Updated `test.py`
      

      `uv` updates the script:

          $ cat test.py
          #!/usr/bin/env python3
          # /// script
          # requires-python = ">=3.13"
          # dependencies = [
          #     "rich",
          # ]
          # ///
      
          import sys
          from rich import print
      
          if sys.version_info < (3, 13):
              print("This script will not work on Python 3.12")
          else:
              print(f"Hello world, this is python {sys.version}")
      

      `uv` runs the script, after installing packages and fetching Python 3.13

          $ uv run test.py
          Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
          Downloading cpython-3.13.5-linux-x86_64-gnu (download)
          Installed 4 packages in 7ms
          Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
      

      And if we run it with Python 3.12, we can see that errors:

          $ uv run --python 3.12 test.py
          warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
          Installed 4 packages in 7ms
          This script will not work on Python 3.12
      

      Works for any Python you're likely to want:

          $ uv python list
          cpython-3.14.0b2-linux-x86_64-gnu                 <download available>
          cpython-3.14.0b2+freethreaded-linux-x86_64-gnu    <download available>
          cpython-3.13.5-linux-x86_64-gnu                   /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
          cpython-3.13.5+freethreaded-linux-x86_64-gnu      <download available>
          cpython-3.12.11-linux-x86_64-gnu                  <download available>
          cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3.12
          cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3 -> python3.12
          cpython-3.11.13-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
          cpython-3.10.18-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
          cpython-3.9.23-linux-x86_64-gnu                   <download available>
          cpython-3.8.20-linux-x86_64-gnu                   <download available>
          pypy-3.11.11-linux-x86_64-gnu                     <download available>
          pypy-3.10.16-linux-x86_64-gnu                     <download available>
          pypy-3.9.19-linux-x86_64-gnu                      <download available>
          pypy-3.8.16-linux-x86_64-gnu                      <download available>
          graalpy-3.11.0-linux-x86_64-gnu                   <download available>
          graalpy-3.10.0-linux-x86_64-gnu                   <download available>
          graalpy-3.8.5-linux-x86_64-gnu                    <download available>

    • Agreed, making the virtual environment management and so much else disappear lets so much more focus go to python itself.

  • Of all the great things people say about UV, this is the one that sold me on it when I found this option in the docs. Such a nice feature.

Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.

whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.

It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.

The main limitation is being english only.

  • Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.

    • You only need the ASR bits -- this is where I got to when I previously looked into running Parakeet:

          # NeMo does not run on 3.13+
          python3.12 -m venv .venv
          source .venv/bin/activate
      
          git clone https://github.com/NVIDIA/NeMo.git nemo
          cd nemo
      
          pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
          pip install .[asr]
      
          deactivate
      

      Then run a transcribe.py script in that venv:

          import os
          import sys
          import nemo.collections.asr as nemo_asr
      
          model_path = sys.argv[1]
          audio_path = sys.argv[2]
      
          # Load from a local path...
          asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path)
      
          # Or download from huggingface ('org/model')...
          asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path)
      
          output = asr_moel.transcribe([audio_path])
          print(output[0])
      

      With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.

      You'll need to modify the python script to process the response and output it in a format you can use.

      1 reply →

Can you give an example why it made your life that much better?

  • I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.

    I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.

    • This, but I want a summary about the 3 hour video first before getting spending the time on it.

      Download -> generate subtitles -> feed to AI for summary works pretty well

  • Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.

    But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.

    Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.

    There's also a great podcast app opportunity here I hope someone seizes.

  • As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.

    • I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.

  • I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.

    10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.

    You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.

    Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.

whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?

  • I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)

  • You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).

Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/

Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.

Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.

You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?