← Back to context

Comment by freakynit

17 hours ago

So just tested this on a bunch of videos. It's crazy accurate (not 100% obviously), fast and resource efficient.

Not only was I testing the stt part, but also, timestamps and speaker identifications. Tested on 3 different videos, local and online both.

Timestamps were precise down to sub 500-ms level even on longer 20+ minute videos. Speaker identifications worked equally well. My old M1 Air didn't hang a single bit while the transcription was going on.

---

1. Here's one from a single speaker video (https://www.youtube.com/watch?v=-X6YzlY_8tM): https://pastebin.com/vPVPNnne

2. A shorter with up to 4 different speakers and mixed, complex scene/narration changes (https://www.youtube.com/watch?v=4tASl0auPOg): https://pastebin.com/iHZZD8Qe

--

My zsh shorthand: `alias transcribe="yapsnap --timestamps --diarize "`