Comment by tempest_
2 days ago
As bad as cloudflare is there is a reason people use it.
If you try and run a site that has content that LLMs want or expensive calls that require a lot of compute and can exhaust resources if they are over used the attack is relentless. It can be a full time job trying to stop people who are dedicated to scrapping the shit out of your site.
Even CF doesnt even really stop it any more. The agent run browsers seem to bypass it with relative ease.
Granted, but there are open source alternatives that don’t have the same obsession with meaningless digital signatures. Turnstile is just a terrible product.
What are the open source options? Turnstile is a replacement for Recaptcha after google moved it from a free product to a paid one.
The main advantage of Turnstile is that is benefits from CFs ubiquity to help judge legitimate vs illegitimate requests.
I would love to know what other options are available in this space aside from Turnstile, Recaptcha and HCaptcha.
Anubis is the new hotness, specifically billing itself as an "AI firewall". If you've had an animé waifu check you're human you've even used it.
Vast majority of websites today can and should be static, which makes even the aggressive llm scrapping non-issue.
One of the things that a lot of LLM scrapers are fetching are git repositories. They could just use git clone to fetch everything at once. But instead, they fetch them commit by commit. That's about as static as you can get, and it is absolutely NOT a non-issue.
No... Basically all git servers have to generate the file contents, diffs etc. on-demand because they don't store static pages for every single possible combination of view parameters. Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental. You could pre-render everything statically, but that could take up gigabytes or more for any repo of non-trivial size.
4 replies →
that's a pretty niche issue, but fairly easy to solve.
Prebuild statically the most common commits (last XX) and heavily rate limit deeper ones
2 replies →