Comment by suyash

11 days ago

Nice to see someone addressing this annoying problem, I'm seeing first hand bot traffic go up as they are just gobbling up data. However instead of relying on Cloudflare, it would be better to have a open source protocol that handles permission and payment for crawlers/scraper.

10 comments

suyash

bgwalter 11 days ago

If you don't want payment, there is:

https://anubis.techaro.lol/

Used by https://gcc.gnu.org/bugzilla/ for example. It is less annoying than CAPTCHA/Turnstile/whatever because the proof of work runs automatically.

marginalia_nu 11 days ago
Sadly they seem to be getting through this one lately. Had a scraper hitting me with 80 qps punching straight through Anubis. Had to set up a global rate limit that browned out the functionality they were interested in[1] in excessive load.
[1] This form https://marginalia-search.com/site/news.ycombinator.com
- xena 11 days ago
  
  Probably headless chrome. I'm going to investigate.
  
  1 reply →
gen6acd60af 11 days ago
See also (AFAIK most of these support JSless challenges out of the box): haproxy-protection, go-away, anticrawl
- xena 11 days ago
  
  Anubis does too: https://anubis.techaro.lol/docs/admin/configuration/challeng...

rswail 11 days ago

The protocol that Cloudflare are proposing could be implemented by anyone. There will need to be ways for crawlers to register and pay.

CF is acting as the merchant of record, so they will be the ones billing, it's unclear what cut of the price they will take (if any) or if they will include it in their bundled services.

This should be expanded to allow for:

* micropayments and subscriptions

* integration with the browser UI/UX

* multiple currencies

* implementation of multiple payment systems, including national instant settlement systems like UPI, NPP, FedNow etc.

Leynos 11 days ago

Is this companies collecting data for model training, or is it agentic tools operating on behalf of users?

Melonai 11 days ago

I think in the grand scheme of things barely anyone uses agents (as of now) to crawl sites quickly apart from maybe a quick Google search or two. At least that's been my observation of my non-technical field friends using LLMs.
From what it looks like in the web logs it is in fact the same few AI company web crawlers constantly crawling and recrawling the same URLs over and over, presumably to get even the slightest advantage over each other, as they are definitely in the zero-sum mindset currently.
xena 11 days ago

Whatever it is, I've seen the commons abuse Gitlab servers so hard they peg 64 high wattage server cores 24/7. Installing mitigations cut their power bill in half.