Comment by taikahessu
2 days ago
We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.
Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.
Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.
For the "good" bots which at least respect robots.txt you can use this list to get ahead of them before they pummel your site.
https://github.com/ai-robots-txt/ai.robots.txt
There's no easy solution for bad bots which ignore robots.txt and spoof their UA though.
Such as OpenAI, who will ignore robots.txt and change their user agent to evade blocks, apparently[1]
1: https://www.reddit.com/r/selfhosted/comments/1i154h7/openai_...
For those looking, this is the best I've found: https://blog.cloudflare.com/declaring-your-aindependence-blo...
This seemed to work for some time when it came out but IME no longer does.
Thanks, will look into that!
It is too bad we don’t have a convention already for the internet:
User/crawler: I’d like site
Server: ok that’ll be $.02 for me to generate it and you’ll have to pay $.01 in bandwidth costs, plus whatever your provider charges you
User: What? Obviously as a human I don’t consume websites so fast that $.03 will matter to me, sure, add it to my cable bill.
Crawler: Oh no, I’m out of money, (business model collapse).
I think that's a terrible idea, especially with ISP monopolies that love gouging their customers. They have a demonstrable history of markups well beyond their means.
And I hope you're pricing this highly. I don't know about you, but I would absolutely notice $.03 a site on my bill, just from my human browsing.
In fact, I feel like this strategy would further put the Internet in the hands of the aggregators as that's the one site you know you can get information from, so long term that cost becomes a rounding error for them as people are funneled to their AI as their memberships are cheaper than accessing the rest of the web.
> We had our non-profit website drained out of bandwidth
There is a number of sites which are having issues with scrapers (AI and others) generating so much traffic that transit providers are informing them that their fees will go up with the next contract renewal, if the traffic is not reduced. It's just very hard for the individual sites to do much about it, as most of the traffic stems from AWS, GCP or Azure IP ranges.
It is a problem and the AI companies do not care.
I want better laws. The boot operator should have to pay you damages for taking down your site.
If acting like inconsiderate tools starts costing money, they may stop.