Comment by buro9

6 months ago

Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

45 comments

buro9

vunderba 6 months ago

There's also popular repository that maintains a comprehensive list of LLM and AI related bots to aid in blocking these abusive strip miners.

https://github.com/ai-robots-txt/ai.robots.txt

Pooge 6 months ago

I didn't know about this. Thank you!
After some digging, I also found a great way to surprise bots that don't respect robots.txt[1] :)
[1]: https://melkat.blog/p/unsafe-pricing

coldpie 6 months ago

You know, at this point, I wonder if an allowlist would work better.

frereubu 6 months ago
I love (hate) the idea of a site where you need to send a personal email to the webmaster to be whitelisted.
- smolder 6 months ago
  
  We just need a browser plugin to auto-email webmasters to request access, and wait for the follow-up "access granted" email. It could be powered by AI.
  
  3 replies →
- Kuraj 6 months ago
  
  I have not heard the word "webmaster" in such a long time
  
  1 reply →
buro9 6 months ago
I have thought about writing such a thing...
1. A proxy that looks at HTTP Headers and TLS cipher choices
2. An allowlist that records which browsers send which headers and selects which ciphers
3. A dynamic loading of the allowlist into the proxy at some given interval
New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.
I'd rather just say "I trust real browsers" and dump the rest.
Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".
- qazxcvbnmlp 6 months ago
  
  Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.
  To truly say “I trust real browsers” requires a signal of integrity of the user and browser such as cryptographic device attestation of the browser. .. which has to be centrally verified. Which is also not great.
  
  2 replies →
- gkbrk 6 months ago
  
  This is Cloudflare with extra steps
jprete 6 months ago

If you mean user-agent-wise, I think real users vary too much to do that.
That could also be a user login, maybe, with per-user rate limits. I expect that bot runners could find a way to break that, but at least it's extra engineering effort on their part, and they may not bother until enough sites force the issue.

jprete 6 months ago

I hope this is working out for you; the original article indicates that at least some of these crawlers move to innocuous user agent strings and change IPs if they get blocked or rate-limited.

Mistletoe 6 months ago

This is a new twist on the Dead Internet Theory I hadn’t thought of.

Dilettante_ 6 months ago

We'll have two entirely separate (dead) internets! One for real hosts who will only get machine users, and one for real users who only get machine content!
Wait, that seems disturbingly conceivable with the way things are going right now. *shudder*

Aeolun 6 months ago

You just plain blocking anyone using node from programatically accessing your content with Axios?

buro9 6 months ago

Apparently yes.
If a more specific UA hasn't been set, and the library doesn't force people to do so, then the library that has been the source of abusive behaviour is blocked.
No loss to me.
phito 6 months ago

Why not?

EVa5I7bHFq9mnYK 6 months ago

>> there is little to no value in giving them access to the content

If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products? Especially given that people now often consult ChatGPT instead of searching at Google?

rchaud 6 months ago

> If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products?
ChatGPT won't 'recommend' anything that wasn't already recommended in a Reddit post, or on an Amazon page with 5000 reviews.
You have however correctly spotted the market opportunity. Future versions of CGPT with offer the ability to "promote" your eshop in responses, in exchange for money.

ai-christianson 6 months ago

Would you consider giving these crawlers access if they paid you?

petee 6 months ago

Interesting idea, though I doubt they'd ever offer a reasonable amount for it. But doesn't it also change a sites legal stance if you're now selling your users content/data? I think it would also repel a number of users away from your service
buro9 6 months ago

At this point, no.
rchaud 6 months ago

No, because the price they'd offer would be insultingly low. The only way to get a good price is to take them to court for prior IP theft (as NYT and others have done), and get lawyers involved to work out a licensing deal.
nedrocks 6 months ago
This is one of the few interesting uses of crypto transactions at reasonable scale in the real world.
- heavyarms 6 months ago
  
  What mechanism would make it possible to enforce non-paywalled, non-authenticated access to public web pages? This is a classic "problem of the commons" type of issue.
  The AI companies are signing deals with large media and publishing companies to get access to data without the threat of legal action. But nobody is going to voluntarily make deals with millions of personal blogs, vintage car forums, local book clubs, etc. and setup a micro payment system.
  Any attempt to force some kind of micro payment or "prove you are not a robot" system will add a lot of friction for actual users and will be easily circumvented. If you are LinkedIn and you can devote a large portion of your R&D budget on this, you can maybe get it to work. But if you're running a blog on stamp collecting, you probably will not.
- oblio 6 months ago
  
  Use the ex-hype to kill the new hype?
  And the ex-hype would probably fail at that, too :-)
- ranger207 6 months ago
  
  What does crypto add here that can't be accomplished with regular payments?

pogue 6 months ago

What do you use to block them?

buro9 6 months ago
Nginx, it's nothing special it's just my load balancer.
if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}
- l1n 6 months ago
  
  403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.
  
  2 replies →
- gs17 6 months ago
  
  From the article:
  > If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).
  It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.
  
  3 replies →

iLoveOncall 6 months ago

4.8M requests sounds huge, but if it's over 7 days and especially split amongst 30 websites, it's only a TPS of 0.26, not exactly very high or even abusive.

The fact that you choose to host 30 websites on the same instance is irrelevant, those AI bots scan websites, not servers.

This has been a recurring pattern I've seen in people complaining about AI bots crawling their website: huge number of requests but actually a low TPS once you dive a bit deeper.

buro9 6 months ago
It's never that smooth.
In fact 2M requests arrived on December 23rd from Claude alone for a single site.
Average 25qps is definitely an issue, these are all long tail dynamic pages.
- l1n 6 months ago
  
  Curious what your robots.txt looked like, if you have a link?