Comment by NobodyNada
3 days ago
My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
Earlier today I found we'd served over a million requests to over 500,000 different IPs.
All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.
The structure of the requests almost certainly means we've been specifically targeted.
But it's also a valid query, reasonably for normal users to make.
From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.
The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.
Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".
However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.
I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.
(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)
If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.
The Duke University Library analysis posted elsewhere in the discussion is promising.
I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.