Comment by buro9
19 days ago
Their appetite cannot be quenched, and there is little to no value in giving them access to the content.
I have data... 7d from a single platform with about 30 forums on this instance.
4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT
That Claude one! Wowser.
Bots that match this (which is also the list I block on some other forums that are fully private by default):
(?i).(AhrefsBot|AI2Bot|AliyunSecBot|Amazonbot|Applebot|Awario|axios|Baiduspider|barkrowler|bingbot|BitSightBot|BLEXBot|Buck|Bytespider|CCBot|CensysInspect|ChatGPT-User|ClaudeBot|coccocbot|cohere-ai|DataForSeoBot|Diffbot|DotBot|ev-crawler|Expanse|FacebookBot|facebookexternalhit|FriendlyCrawler|Googlebot|GoogleOther|GPTBot|HeadlessChrome|ICC-Crawler|imagesift|img2dataset|InternetMeasurement|ISSCyberRiskCrawler|istellabot|magpie-crawler|Mediatoolkitbot|Meltwater|Meta-External|MJ12bot|moatbot|ModatScanner|MojeekBot|OAI-SearchBot|Odin|omgili|panscient|PanguBot|peer39_crawler|Perplexity|PetalBot|Pinterestbot|PiplBot|Protopage|scoop|Scrapy|Screaming|SeekportBot|Seekr|SemrushBot|SeznamBot|Sidetrade|Sogou|SurdotlyBot|Timpibot|trendictionbot|VelenPublicWebCrawler|WhatsApp|wpbot|xfa1|Yandex|Yeti|YouBot|zgrab|ZoominfoBot).
I am moving to just blocking them all, it's ridiculous.
Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).
There's also popular repository that maintains a comprehensive list of LLM and AI related bots to aid in blocking these abusive strip miners.
https://github.com/ai-robots-txt/ai.robots.txt
I didn't know about this. Thank you!
After some digging, I also found a great way to surprise bots that don't respect robots.txt[1] :)
[1]: https://melkat.blog/p/unsafe-pricing
You know, at this point, I wonder if an allowlist would work better.
I love (hate) the idea of a site where you need to send a personal email to the webmaster to be whitelisted.
We just need a browser plugin to auto-email webmasters to request access, and wait for the follow-up "access granted" email. It could be powered by AI.
3 replies →
I have not heard the word "webmaster" in such a long time
1 reply →
I have thought about writing such a thing...
1. A proxy that looks at HTTP Headers and TLS cipher choices
2. An allowlist that records which browsers send which headers and selects which ciphers
3. A dynamic loading of the allowlist into the proxy at some given interval
New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.
I'd rather just say "I trust real browsers" and dump the rest.
Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".
Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.
To truly say “I trust real browsers” requires a signal of integrity of the user and browser such as cryptographic device attestation of the browser. .. which has to be centrally verified. Which is also not great.
2 replies →
This is Cloudflare with extra steps
If you mean user-agent-wise, I think real users vary too much to do that.
That could also be a user login, maybe, with per-user rate limits. I expect that bot runners could find a way to break that, but at least it's extra engineering effort on their part, and they may not bother until enough sites force the issue.
I hope this is working out for you; the original article indicates that at least some of these crawlers move to innocuous user agent strings and change IPs if they get blocked or rate-limited.
This is a new twist on the Dead Internet Theory I hadn’t thought of.
We'll have two entirely separate (dead) internets! One for real hosts who will only get machine users, and one for real users who only get machine content!
Wait, that seems disturbingly conceivable with the way things are going right now. *shudder*
You just plain blocking anyone using node from programatically accessing your content with Axios?
Apparently yes.
If a more specific UA hasn't been set, and the library doesn't force people to do so, then the library that has been the source of abusive behaviour is blocked.
No loss to me.
Why not?
>> there is little to no value in giving them access to the content
If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products? Especially given that people now often consult ChatGPT instead of searching at Google?
> If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products?
ChatGPT won't 'recommend' anything that wasn't already recommended in a Reddit post, or on an Amazon page with 5000 reviews.
You have however correctly spotted the market opportunity. Future versions of CGPT with offer the ability to "promote" your eshop in responses, in exchange for money.
Would you consider giving these crawlers access if they paid you?
Interesting idea, though I doubt they'd ever offer a reasonable amount for it. But doesn't it also change a sites legal stance if you're now selling your users content/data? I think it would also repel a number of users away from your service
At this point, no.
No, because the price they'd offer would be insultingly low. The only way to get a good price is to take them to court for prior IP theft (as NYT and others have done), and get lawyers involved to work out a licensing deal.
This is one of the few interesting uses of crypto transactions at reasonable scale in the real world.
What mechanism would make it possible to enforce non-paywalled, non-authenticated access to public web pages? This is a classic "problem of the commons" type of issue.
The AI companies are signing deals with large media and publishing companies to get access to data without the threat of legal action. But nobody is going to voluntarily make deals with millions of personal blogs, vintage car forums, local book clubs, etc. and setup a micro payment system.
Any attempt to force some kind of micro payment or "prove you are not a robot" system will add a lot of friction for actual users and will be easily circumvented. If you are LinkedIn and you can devote a large portion of your R&D budget on this, you can maybe get it to work. But if you're running a blog on stamp collecting, you probably will not.
Use the ex-hype to kill the new hype?
And the ex-hype would probably fail at that, too :-)
What does crypto add here that can't be accomplished with regular payments?
What do you use to block them?
Nginx, it's nothing special it's just my load balancer.
if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}
403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.
2 replies →
From the article:
> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).
It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.
3 replies →
4.8M requests sounds huge, but if it's over 7 days and especially split amongst 30 websites, it's only a TPS of 0.26, not exactly very high or even abusive.
The fact that you choose to host 30 websites on the same instance is irrelevant, those AI bots scan websites, not servers.
This has been a recurring pattern I've seen in people complaining about AI bots crawling their website: huge number of requests but actually a low TPS once you dive a bit deeper.
It's never that smooth.
In fact 2M requests arrived on December 23rd from Claude alone for a single site.
Average 25qps is definitely an issue, these are all long tail dynamic pages.
Curious what your robots.txt looked like, if you have a link?