Comment by ralph84
11 hours ago
The value of YouTube for AI isn't making AI videos, it's that it's an incredibly rich source for humanity's current knowledge in one place. All of the tutorials, lectures, news reports, etc. are great for training models.
Is that actually a moat? Seems like all model providers managed to scrape the entire textual internet just fine. If video is the next big thing I don’t see why they won’t scrape that too.
Scraping text across the entire internet is orders of magnitudes easier than scraping YouTube. Even ignoring the sheer volume of data (exabytes), you simply will get blocked at an IP and account level before you make a reasonable dent. Even if you controlled the entire IPv4 space I’m not sure you could scrape all of YouTube without getting every single address banned. IPv6 makes address bans harder, true, but then you’re still left with the problem of actually transferring and then storing that much data.
For now, you actually get pretty far with Tor. Just reset your connection when you hit an IP ban by sending SIGHUP to the Tor daemon.
I did that when I was retraining Stable Audio for fun and it really turned out to be trivial enough to pull of as a little evening side project.
IPv6 doesn't make it "harder," as they would typically ban whole /48 prefixes.
And we're probably already starting to see that, given the semirecent escalations in game of cat and also cat of youtube and the likes of youtube-dl.
Reminds me of Reddit's cracking down on API access after realizing that their data was useful. But I'd expect both youtube to be quicker on the gun knowing about AI data collection, and have more time because of the orders of magnitude greater bandwidth required to scrape video.
And reddit turned around and sold it all for a mess of pottage…
1 reply →
> Seems like all model providers managed to scrape the entire textual internet just fine
Google, though, has been doing it for literal decades. That could mean that they have something nobody else (except archive.org) has - a history on how the internet/knowledge has evolved.