Comment by suyash
11 days ago
Nice to see someone addressing this annoying problem, I'm seeing first hand bot traffic go up as they are just gobbling up data. However instead of relying on Cloudflare, it would be better to have a open source protocol that handles permission and payment for crawlers/scraper.
If you don't want payment, there is:
https://anubis.techaro.lol/
Used by https://gcc.gnu.org/bugzilla/ for example. It is less annoying than CAPTCHA/Turnstile/whatever because the proof of work runs automatically.
Sadly they seem to be getting through this one lately. Had a scraper hitting me with 80 qps punching straight through Anubis. Had to set up a global rate limit that browned out the functionality they were interested in[1] in excessive load.
[1] This form https://marginalia-search.com/site/news.ycombinator.com
Probably headless chrome. I'm going to investigate.
1 reply →
See also (AFAIK most of these support JSless challenges out of the box): haproxy-protection, go-away, anticrawl
Anubis does too: https://anubis.techaro.lol/docs/admin/configuration/challeng...
The protocol that Cloudflare are proposing could be implemented by anyone. There will need to be ways for crawlers to register and pay.
CF is acting as the merchant of record, so they will be the ones billing, it's unclear what cut of the price they will take (if any) or if they will include it in their bundled services.
This should be expanded to allow for:
* micropayments and subscriptions
* integration with the browser UI/UX
* multiple currencies
* implementation of multiple payment systems, including national instant settlement systems like UPI, NPP, FedNow etc.
Is this companies collecting data for model training, or is it agentic tools operating on behalf of users?
I think in the grand scheme of things barely anyone uses agents (as of now) to crawl sites quickly apart from maybe a quick Google search or two. At least that's been my observation of my non-technical field friends using LLMs.
From what it looks like in the web logs it is in fact the same few AI company web crawlers constantly crawling and recrawling the same URLs over and over, presumably to get even the slightest advantage over each other, as they are definitely in the zero-sum mindset currently.
Whatever it is, I've seen the commons abuse Gitlab servers so hard they peg 64 high wattage server cores 24/7. Installing mitigations cut their power bill in half.