Comment by genewitch
1 day ago
ripgrep, except against the full 160GB dataset, mongoDB was faster on my ryzen.
I have a lot of subtitles. I'm partially hard of hearing and partially i can't stand the way everything is mastered, so i use volume normalization (sometimes called "night mode", vizio calls it this) and subtitles to make up for the fact that the audio tracks in most things is bad.
Well a side effect of subtitles is now i have context for every video that i can search. grep was grep.
you didn't think i'd leave you hanging https://i.imgur.com/Vs5AAT7.png
some other non-statistics from that day: 15GB sorted password list, newline delimited, UTF-8 from spinningrust drive 64 seconds (~234MB/s) to make a copy of the file. ag and rg took 3.2 seconds to search the copy. I'm actually hesitant to state that grep took 52 seconds...
Thanks for replying, thanks for making me remember the great conversations we had around those topics a couple months ago, and thanks for creating ripgrep, it's my go-to for anything non-trivial!
Love it! That's awesome. Thank you for replying. :-)
I've occasionally wanted to put the subtitles from all of my Simpson episodes into an easily searchable format. What do you use to extract subtitles?
I should note it sounds like i am a pirate, but when i rip a DVD i use handbrake to "Burn In" the subtitles straight to the video - since i originally cared about this because of my hearing. This "hey having all these as .vtt/.srt/.ass means i can search my recollection of this media!" came much (much, 20 years) later.
I burn in the subtitles because "streaming media players" nearly universally are awful at handing anything except 100% perfect subtitles - and that's if they bother handling them at all, over the last 20 years. Also my best friend was dating a deaf person, so the impetus for burning in for streaming was because we'd watch movies together in my living room on a rear projection TV via wifi streaming from a WHS "plex-like" server in my room. The device was a western digital something TV.
Oh i've never extracted, i use openai-whisper for long content and whisper-diarization for shorter content (<8 minutes or so). As i suspected, ffmpeg claims to handle it: https://trac.ffmpeg.org/wiki/ExtractSubtitles with a note that probably should be used with `-c copy` to ensure a 1:1 copy of the subtitles.
also when i get stuff from a website with yt-dlp for archival i use
```pwsh
$userInput = Read-Host -Prompt '480 video download script enter URL'
Write-Output "URL:`t`t$userInput"
yt-dlp.exe `
-f 'bestvideo[height<=480]+bestaudio/best[height<=480]' `
--write-auto-subs --write-subs `
--fragment-retries infinite `
$userInput
```