Comment by embedding-shape

1 day ago

I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.

4 comments

embedding-shape

nmacias 18 hours ago

GOOG-411 was "competing" with a strong company (1-800-FREE411) by serving no ads in a category worth ~$3.5B at the time. It was inexplicable at the time, but they did this to get voice samples, way back when. For reasons like that, I expect that this category of training is baked — but I don't have current domain knowledge fwiw.

hirako2000 1 day ago

It's already there. And keeps moving.

Even have a nice UI on top.

https://voicebox.sh/

jubilanti 1 day ago

Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.

https://commonvoice.mozilla.org/en/languages

interludead 17 hours ago

Yep, the silence around provenance is probably the most suspicious part