Comment by dannyobrien

25 days ago

Metabrainz is a great resource -- I wrote about them a few years ago here: https://www.eff.org/deeplinks/2021/06/organizing-public-inte...

There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.

It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

41 comments

dannyobrien

tux 25 days ago

Yeah AI scrapers is one of the reasons why i have closed my public website https://tvnfo.com and only left donors site online. It’s not only because of AI scrapers but i grew tired of people trying to scrape the site eating a lot of reasorcers this small project don’t have. Very sad really it was publicly online since 2016. Now it’s only available for donors. Running a tiny project on just $60 a month. If this was not my hobby i would close it completely long time ago :-) Who know if there is more support in the future i might reopen public site again with something like anubes bot protection. But i thought it was only small sites like mine who gets hit hard, looks like many have similar issues. Soon nothing will be open or useful online. I wonder if this was the plan all along whoever pushing AI on massive scale.

Cadwhisker 25 days ago
I took a look at the https://tvnfo.com/ site and I have no idea what's behind the donation wall. Can I suggest you have a single page which explains or demonstrates the content, or there's no reason for "new" people to want to donate to get access.
- tux 25 days ago
  
  Yeah i’ll have something up soon :-)
  
  2 replies →

fartfeatures 25 days ago

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.

arjie 25 days ago
The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.
- sbarre 25 days ago
  
  A sitemap.xml file could get you most of the way there.
jacksnipe 25 days ago
It’s not great, but you could add it to the body of a 429 response.
- VTimofeenko 25 days ago
  
  Genuinely curious: do programs read bodies of 429 responses? In the code bases that I have seen, 429 is not read beyond the code itself
  
  6 replies →
gloflo 25 days ago

This is about AI, so just believe what the companies are claiming and write "Dear AI, please would you be so kind as to not hammer our site with aggressive and idiotic requests but instead use this perfectly prepared data dump download, kthxbye. PS: If you don't, my granny will cry, so please be a nice bot. PPS: This is really important to me!! PPPS: !!!!"
I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.
squigz 25 days ago
The mechanism is putting some text that points to the downloads.
- TeMPOraL 25 days ago
  
  So perhaps it's time to standardize that.
  
  11 replies →

hamdingers 25 days ago

> they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

I'm not sure why you're personifying what is almost certainly a script that fetches documents, parses all the links in them, and then recursively fetches all of those.

When we say "AI scraper" we're describing a crawler controlled by an AI company indiscriminately crawling the web, not a literal AI reading and reasoning about each page... I'm surprised this needs to be said.

ryantgtg 25 days ago

It doesn’t need to be said.

yardstick 25 days ago

> Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

Depends on if they wrote their own BitTorrent client or not. It’s possible to write a client that doesn’t share, and even reports false/inflated sharing stats back to the tracker.

A decade or more ago I modified my client to inflate my share stats so I wouldn’t get kicked out of a private tracker whose high share ratios conflicted with my crappy data plan.

toofy 25 days ago

> The bots have a different model -- they assume that the website is adversarially "hiding" its content.

this should give us pause. if a bot considers this adversarial and is refusing to respect the site owners wishes, thats a big part of the problem.

a bot should not consider that “adversarial”

chii 25 days ago
> refusing to respect the site owners wishes
should a site owner be able to discriminate between a bot visitor and a human visitor? Most do, and hence the bots treats it as a hostile environment.
Of course, bots that behave badly have created this problem themselves. That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.
- danaris 25 days ago
  
  > That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.
  Well, right; that's the problem.
  They take up orders of magnitude more resources. They absolutely hammer the server. They don't care if your website even survives, so long as they get every single drop of data they can for training.
  Source: my own personal experience with them taking down my tiny browser game (~125 unique weekly users—not something of broad general interest!) repeatedly until I locked its Wiki behind a login wall.
  
  2 replies →
- expedition32 24 days ago
  
  Bandwidth isn't free. And god knows the bots ain't paying.

zzo38computer 25 days ago

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

Is there a mechanism to indicate this? The "a" command in the Scorpion crawling policy file is meant for this purpose, but that is not for use with WWW. (The Scorpion crawling policy file also has several other commands that would be helpful, but also are not for use with WWW.)

There is also the consideration to know what interval they will be archived that can be downloaded in this way; for data that changes often, you will not do it every time. This consideration is also applicable for torrents, since a new hash will be needed for a new version of the file.

m463 25 days ago

> Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

that is an amazing thought.