← Back to context

Comment by phyzix5761

8 hours ago

Why would they tell the LLM exactly how to download all their files in bulk for free? Isn't that the opposite of the self-preservation they're trying to do?

I think, obviously, they're trying to get the LLM to make a donation without explicit user approval but I think they're shooting themselves in the foot.

We recently saw a post on here about an Italian Pokemon website getting near 0 traffic after Google AI indexed and trained on their data. Sadly, I think this is going to happen to a lot of sites. Not sure how we can stop it. Any ideas?

It's telling LLMs how to download all their files in a way that has the least impact on their infrastructure, while telling it that any other way will be met with CAPTCHAs. In the short-term, that seems beneficial. LLMs can be quite persistent in their bad crawling attempts

What the role of Anna's archive plays in the future is an interesting question. But I'm optimistic about it. And if Anna's archive fails, but lots of OpenClaw instances are hosting the torrents or at least have a local copy of parts of the library that's still a decent outcome

They are trying to distribute information, not get traffic.

The hope is probably that the LLM's will download properly rather than DDOSing them.

Honestly I think they are being a bit naive and assume that the scrapers gives a shit.

A few of the large AI companies might care enough to set up a custom solution for you, assuming that your dataset is sufficiently large. Most doesn't. HTTP is the common protocol and HTML the standard format, a torrent is just needless hassle.

The problem Anna's Archive also have is that the legality is questionable and having an official collaboration with them might be problematic. Better to just crawl the site and claim that you crawl the entire web so you accidentally crawled Anna's Archive.

  • I wouldn't be surprised if all the large AI labs already had an FTP account for Anna's

    At the very least the chinese ones definitely would regardless of the legality, the western labs would keep it under wraps but they also probably do.

    At their scale, he cost of scraping or getting it directly from Anna's sources is way higher than just donating $50k and getting easy, fast access

> Why would they tell the LLM exactly how to download all their files in bulk for free? Isn't that the opposite of the self-preservation they're trying to do?

The goal of AA is to spread the data for free, not to gatekeep it. Donations are optional.