Comment by pedrozieg

2 months ago

What I like about this approach is that it quietly reframes the problem from “detect AI” to “make abusive access patterns uneconomical”. A simple JS+cookie gate is basically saying: if you want to hammer my instance, you now have to spin up a headless browser and execute JS at scale. That’s cheap for humans, expensive for generic crawlers that are tuned for raw HTTP throughput.

The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.

18 comments

pedrozieg

nucleardog 2 months ago

Yeah, this is where I landed a while ago. What problem am I _really_ trying to solve?

For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.

For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.

I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:

Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.

The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.

It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.

If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip

embedding-shape 2 months ago
> If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.
- cmrx64 2 months ago
  
  whois (delegation files) according to the embedded blog post, eg https://ftp.arin.net/pub/stats/arin/delegated-arin-extended-...
sgc 2 months ago
You ban the ASN permanently in this scenario?
- nucleardog 2 months ago
  
  So far, yes.
  I could justify it a number of ways, but the honest answer is "expiring these is more work that just hasn't been needed yet". We hit a handful of bad actors, banned them, have heard no negative outcomes, and there's really little indication of the behaviour changing. Unless something shows up and changes the equation, right now it looks like "extra effort to invite the bad actors back to do bad things" and... my day is already busy enough.
doctorpangloss 2 months ago
i don't know. use PAT. the long term solution is web environment integrity by another name.
- tjpnz 2 months ago
  
  And by a company which isn't knee deep in this itself.

hombre_fatal 2 months ago

It depends what your goal is.

Having to use a browser to crawl your site will slow down naive crawlers at scale.

But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.

Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, being a forum, it made more sense to just block the traffic.

ethmarks 2 months ago
Maybe a stupid question but how can Cloudflare detect what portion of traffic is coming from LLM agents? Do agents identify themselves when they make requests? Are you just assuming that all playwright traffic originated from an agent?
- hombre_fatal 2 months ago
  
  That is what Cloudflare's bot metrics dashboard told me before I enabled their "Super Bot Fighter" system that brought traffic back down to its pre-bot levels.
  I assume most traffic comes from hosted LLM chats (e.g. chatgpt.com) where the provider (e.g. OpenAI) is making the requests from their own servers.

pm215 2 months ago

I'm curious about whether there are well coded AI scrapers that have logic for "aha, this is a git forge, git clone it instead of scraping, and git fetch on a rescrape". Why are there apparently so many naive (but still coded to be massively parallel and botnet like, which is not naive in that aspect) crawlers out there?

ncruces 2 months ago

If they're handling it as “website, don't care” (because they're training on everything online) they won't know.
If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.
It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.
The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.
ffsm8 2 months ago
I'm not an industry insider and not the source of this fact, but it's been previously stated that traffic costs to fetch the current data for each training run is cheaper then caching it in any way locally - wherever it's a git repo, static sites or any other content available through http
- pm215 2 months ago
  
  This seems nuts and suggests maybe the people selling AI scrapers their bandwidth could get away with charging rather more than they do :)
telliott1984 2 months ago

I'd see this as coming down to incentive. If you can scrape naively and it's cheap, what's the benefit to you in doing something more efficient for git forge? How many other edge cases are there where you could potentially save a little compute/bandwidth, but need to implement a whole other set of logic?
Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.
Another tangent: there probably are better behaved scrapers, we just don't notice them as much.
the_biot 2 months ago

True, and it doesn't get mentioned enough. These supposedly world-changing advanced tech companies sure look sloppy as hell from here. There is no need for any of this scraping.
LtWorf 2 months ago

I guess they're vibe coded :D

agumonkey 2 months ago

what's next: you can only read my content after mining btc and wiring it to $wallet->address