← Back to context

Comment by mrweasel

6 hours ago

Honestly I think they are being a bit naive and assume that the scrapers gives a shit.

A few of the large AI companies might care enough to set up a custom solution for you, assuming that your dataset is sufficiently large. Most doesn't. HTTP is the common protocol and HTML the standard format, a torrent is just needless hassle.

The problem Anna's Archive also have is that the legality is questionable and having an official collaboration with them might be problematic. Better to just crawl the site and claim that you crawl the entire web so you accidentally crawled Anna's Archive.

I wouldn't be surprised if all the large AI labs already had an FTP account for Anna's

At the very least the chinese ones definitely would regardless of the legality, the western labs would keep it under wraps but they also probably do.

At their scale, he cost of scraping or getting it directly from Anna's sources is way higher than just donating $50k and getting easy, fast access