Comment by derf_
4 days ago
Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/
[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
Wrt data, it's not like there is a shortage of transcribed audio in the form of music sheets, lyrics and subtitles.
If one asks ~~nice~~ expensive enough they can even get isolated multitracks or teleprompter feeds together with the audiovisual tracks. Heck, if they wanted they could set up dedicated transcription teams for the plethora of podcasts with the costs somewhere in the rounding error range. But you can't siphon that off of torrents and paying for training material goes against the core ethics of the big players.
Too bad you can't really scrape tiktok/instagram reels with subtitles... Oh no, oh no, oh no no no no
[flagged]