Comment by nucleardog

2 months ago

Yeah, this is where I landed a while ago. What problem am I _really_ trying to solve?

For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.

For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.

I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:

Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.

The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.

It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.

If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip

6 comments

nucleardog

embedding-shape 2 months ago

> If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip

What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.

cmrx64 2 months ago

whois (delegation files) according to the embedded blog post, eg https://ftp.arin.net/pub/stats/arin/delegated-arin-extended-...

sgc 2 months ago

You ban the ASN permanently in this scenario?

nucleardog 2 months ago

So far, yes.
I could justify it a number of ways, but the honest answer is "expiring these is more work that just hasn't been needed yet". We hit a handful of bad actors, banned them, have heard no negative outcomes, and there's really little indication of the behaviour changing. Unless something shows up and changes the equation, right now it looks like "extra effort to invite the bad actors back to do bad things" and... my day is already busy enough.

doctorpangloss 2 months ago

i don't know. use PAT. the long term solution is web environment integrity by another name.

tjpnz 2 months ago

And by a company which isn't knee deep in this itself.