← Back to context

Comment by friendzis

6 days ago

Wrt data, it's not like there is a shortage of transcribed audio in the form of music sheets, lyrics and subtitles.

If one asks ~~nice~~ expensive enough they can even get isolated multitracks or teleprompter feeds together with the audiovisual tracks. Heck, if they wanted they could set up dedicated transcription teams for the plethora of podcasts with the costs somewhere in the rounding error range. But you can't siphon that off of torrents and paying for training material goes against the core ethics of the big players.

Too bad you can't really scrape tiktok/instagram reels with subtitles... Oh no, oh no, oh no no no no