Comment by zahlman
8 days ago
I actually don't understand who Anubis is supposed to "make sure you're not a bot". It seems to be more of a rate limiter than anything else. It self-describes:
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
My understanding is that it just increases the "expense" of mass crawling just enough to put it out of reach. If it costs fractional pennies per page scrape with just a python or go bot, it costs nickels and dimes to run a headless chromium instance to do the same thing. The purpose is economical - make it too expensive to scrape the "open web". Whether it achieves that goal is another thing.
what do AI companies have more than everyone else? compute
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
"what do AI companies have more than everyone else? compute"
"Everyone else" actually has staggering piles of compute, utterly dwarfing the cloud, utterly dwarfing all the AI companies, dwarfing everything. It's also generally "free" on the margin. That is, if your web page takes 10 seconds to load due to an Anubis challenge, in principle you can work out what it is costing me but in practice it's below my noise floor of life expenses, pretty much rolled in to the cost of the device and my time. Whereas the AI companies will notice every increase of the Anubis challenge strength as coming straight out of their bottom line.
This is still a solid and functional approach. It was always going to be an arms race, not a magic solution, but this approach at least slants the arms race in the direction the general public can win.
(Perhaps tipping it in the direction of something CPUs can do but not GPUs would help. Something like an scrypt-based challenge instead of a SHA-256 challenge. https://en.wikipedia.org/wiki/Scrypt Or some sort of problem where you need to explore a structure in parallel but the branches have to cross-talk all the time and the RAM is comfortably more than a single GPU processing element can address. Also I think that "just check once per session" is not going to make it but there are ways you can make a user generate a couple of tokens before clicking the next link so it looks like they only have to check once per page, unless they are clicking very quickly.)
Anubis increases the minimum amount of compute required to request and crawl a page. How does that incentivize the adversary?
"Everyone else" (individually) isn't going to millions of webpages per day.
3 replies →
Please don't downvote comments only because you don't like their opinion (reply to them instead). It cannot be that the same opinion is valueable when someone famous write it [1].
[1] https://news.ycombinator.com/item?id=44962529
You have it right. The problem Anubis is intended to solve isn't bots per se, the problem is that bot networks have figured out how to bypass rate limits by sending requests from newly minted, sometimes residential, ip addresses/ranges for each request. Anubis tries to help somewhat by making each (client, address) perform a proof-of-work. For normal users this should be an infrequent inconvenience but for those bot networks they have to do it every time. And if they solve the challenge and keep the cookie then the server "has them" so to speak and can apply ip rate limits normally.
it's indeed not a "bot/crawler protection"
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
The AI crawlers have tens of thousands of IPs and some of them use something akin to a residential botnet.
If they notice that they are getting rate limited or IP blocked, they will use each IP only once. This means that IP based rate limiting simply doesn't work.
The proof of work algorithm in Anubis creates an initial investment that is amortized over multiple requests. If you decide to throw the proof away, you will waste more energy, but if you don't, you can be identified and rate limited.
The automated agent. An never get around this, since running the code is playing by the rules. The goal of the automated agent is to ignore the rules.
I think the only requests it was able to block are plain http requests made over curl or Go's stdlib http client. I see enough of both in httpd logs. Now the cancer has adapted by using a fully featured headless web browser that can complete challenges just like any other client.
As other commenters say, it was completely predictable from the start.
SHA256 is an odd choice given that there are ASICs readily available that can compute hashes crazy fast, made for Bitcoin mining. An ASIC-resistant/memory hard algorithm would be a better choice, possibly one of the Argon2 variants.
Another approach. Require a hash(RESOURCE_ID, ITERATIONS, MEMORY_COST) for each and every resource request. Admittedly that might get a little tricky considering that you don't want to bog down actual users with sluggish page loads. But if carefully tuned to the highest tolerable level it might actually be sufficient. (Maybe.) It's a hard problem....
My dumb idea is to encrypt each HTML element as chain of encryption that requires full load of an HTML/JS element to get another key to load another HTML/JS element and so on. Key retrieval can be throttled and mixed between client and server side and embedded with each requests to prevent browser load everything at once.
This may tread too close to DRM tho due to element protection scheme.
Since there is varying but requester independent input into the hash function, doesnt this mean that the server has to calculate the entire value space too and that these resource hashes can be reused across different requester?
Binding a challange-response to a specific resource doesnt sound like such a bad idea though.
Well exactly. The only variability would be on a per-resource basis, so the server-side calculations would likely be quite manageable. The RESOURCE_ID could be a simple concatenation of the name, size, and last-modification-date of the resource, the ITERATIONS parameter would obviously be tuned to by experimentation, and the MEMORY_COST needed based on some sort of heuristic.
The real question is whether or not it would really be enough to discourage indiscriminate/unrestrained scraping. The disparity between the computing resources of your average user and a GPU-accelerated bot with tons of memory is after all so lop-sided that such an approach may not even be sufficient. For a user to compute a hash that requires 1024 iterations of an expensive function which demands 25 MB of memory might seem like a promising scraping deterrent at first glance. On the other hand, to a company which has numerous cores per processor running in separate threads and several terabytes of RAM at it's disposal (multiplied by scores of computer racks) it might just be like a drop in the bucket. In any case, it would definitely require a modicum of tuning/testing to see if it is even viable.
I have actually implemented this very kind of hash function in the past and can attest that the implementation is fairly trivial. With just a bit of number theory and some sponge-contruction tricks you can achieve a highly robust implementation with just a few dozen lines of Javascript code. Maybe when I have the time I should put something up on Github as a proof-of-concept for people to play with. =)
Near as I can guess, the idea is that the code is optimized for what browsers can do and gpus/servers/crawlers/etc can't do as easily (or relatively as easily, just taking up the whole server for a bit might a big cost). Indeed it seems like only a matter of time before something like that would be broken.