Comment by johnklos
3 days ago
This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.
As the botmakers circumvent, new methods of proof-of-notbot will be made available.
It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.
This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
> they are just feigning a lack of understanding to be dismissive of Anubis.
I actually find the featured article very interesting. It doesn't feel dismissive of Anubis, but rather it questions whether this particular solution makes sense or not in a constructive way.
I agree - the article is interesting and not dismissive.
I was talking more about some of the people here ;)
I still don't understand what Anubis solves if it can be bypassed too easily: If you use User-agent switcher (i emulate wget) as firefox addon on kernel.org or ffmpeg.org you save the entire check time and straight up skip Anubis. Apparently they use a whitelist for user-agents due to allowing legitimate wget usage on these domains. However if I (an honest human can) the scrapers and grifters can too.
https://addons.mozilla.org/en-US/firefox/addon/uaswitcher/
If anyone wants to try themselves. This is by no means against Anubis, but raising the question: Can you even protect a domain if you force yourself to whitelist (for a full bypass) easy to guess UAs?
3 replies →
It really should be recognised just how many people are watching Cloudflare interstitials on nearly every site these days (and I totally get why this happens) yet making a huge amount of noise about Anubis on a very small amount of sites.
I don't trip over CloudFlare except when in a weird VPN, and then it always gets out of my way after the challenge.
Anubis screws with me a lot, and often doesn't work.
The annoying thing about cloudflare is that most of the time once you’re blocked: you’re blocked.
There’s literally no way for you to bypass the block if you’re affected.
Its incredibly scary, I once had a bad useragent (without knowing it) and half the internet went offline, I couldn’t even access documentation or my email providers site, and there was no contact information or debugging information to help me resolve it: just a big middle finger for half the internet.
I haven’t had issues with any sites using Anubis (yet), but I suspect there are ways to verify that you’re a human if your browser fails the automatic check at least.
15 replies →
I'm on an older system here, and both Cloudflare and Anubis entirely block me out of sites. Once you start blocking actual users out of your sites, it simply has gone too far. At least provide an alternative method to enter your site (e.g. via login) that's not hampered by erroneous human checks. Same for the captchas where you help train AIs by choosing out of a set of tiny/ noisy pictures. I often struggle for 5 to 10 minutes to get past that nonsense. I heard bots have less trouble.
Basically we're already past the point where the web is made for actual humans, now it's made for bots.
8 replies →
It's the other way around for me sometimes — I've never had issue with Anubis, I frequently get it with CF-protected sites.
(Not to mention all the sites which started putting country restrictions in on their generally useful instruction articles etc — argh)
I’m planning a trip to France right now, and it seems like half the websites in that country (for example, ratp.fr for Paris public transport info) require me to check a CloudFlare checkbox to promise that I am a human. And of those that don’t, quite a few just plain lock me out...
4 replies →
I get one basically every time I go to gitlab.com on Firefox.
It is easy to pass the challange, but it isn't any better than Anubis.
Even when not on VPN, if a site uses the CloudFlare interstitials, I will get it every single time - at least the "prove you're not a bot" checkbox. I get the full CAPTCHA if I'm on a VPN or I change browsers. It is certainly enough to annoy me. More than Anubis, though I do think Anubis is also annoying, mainly because of being nearly worthless.
You must be on a good network. You should run one of those "get paid to share your internet connection with AI companies" apps. Since you're on a good network you might make a lot of money. And then your network will get cloudflared, of course.
We should repeat this until every network is cloudflared and everyone hates cloudflare and cloudflare loses all its customers and goes bankrupt. The internet would be better for it.
For me both are things that mostly show up for 1-3 seconds, then get replaced by the actual website. I suspect that's the user experience of 99% of people.
If you fall in the other 1% (e.g. due to using unusual browsers or specific IP ranges), cloudflare tends to be much worse
I hit Cloudflare's garbage about as much as I hit Anubis. With the difference that far more sites use Cloudflare than Anubis, thus Anubis is far worse at triggering false positives.
Huh? What false positives does Anubis produce?
The article doesn't say and I constantly get the most difficult Google captchas, cloudflare block pages saying "having trouble?" (which is a link to submit a ticket that seems to land in /dev/null), IP blocks because user agent spoofing, errors "unsupported browser" when I don't do user agent spoofing... the only anti-bot thing that reliably works on all my clients is Anubis. I'm really wondering what kinds of false positives you think Anubis has, since (as far as I can tell) it's a completely open and deterministic algorithm that just lets you in if you solve the challenge, and as the author of the article demonstrated with some C code (if you don't want to run the included JavaScript that does it for you), that works even if you are a bot. And afaik that's the point: no heuristics and false positives but a straight game of costs; making bad scraping behavior simply cost more than implementing caching correctly or using commoncrawl
4 replies →
So yes, it is like having a stalker politely open the door for you as you walk into a shop, because they know very well who you are.
9 replies →
That says something about the chosen picture, doesn't it? Probably that it's not well liked. It certainly isn't neutral, while the Cloudfare page is.
You know, you say that, and while I understand where you're coming from I was browsing the git repo when github had a slight error and I was greeted with an angry pink unicorn. If Github can be fun like that, Anubis can too, I think.
6 replies →
Anubis was originally an open source project built for a personnal blog. It gained traction but the anime girl remained so that people are reminded of the nature of the project. Comparing it with Cloudflare is truly absurd. That said, a paid version is available with guard page customization.
Nothing says, "Change out the logo for something that doesn't make my clients tingle in an uncomfortable way" like the MIT license.
10 replies →
Both are equally terrible - one doesn't require explanations to my boss though
If your boss doesn't want you to browse the web, where some technical content is accompanied by an avatar that the author likes, they may not be suitable as boss, or at least not for positions where it's their job to look over your shoulder and make sure you're not watching series during work time. Seems like a weird employment place if they need to check that anyway
6 replies →
If Anubis didn't ship with a weird looking anime girl I think people would treat it akin to Cloudflares block pages.
1 reply →
We can make noise about both things, and how they're ruining the internet.
Cloudflare's solution works without javascript enabled unless the website turns up the scare level to max or you are on an IP with already bad reputation. Anubis does not.
But at the end of the day both are shit and we should not accept either. That includes not using one as an excuse for the other.
Laughable. They say this but anyone who actually surfs the web with a non-bleeding edge non-corporate browser gets constantly blocked by Cloudflare. The idea that their JS computational paywalls only pop up rarely is absurd. Anyone believing this line lacks lived experience. My Comcast IP shouldn't have a bad rep and using a browser from ~2015 shouldn't make me scary. But I can't even read bills on congress.gov anymore thanks to bad CF deployals.
Also, Anubis does have a non-JS mode: the HTML header meta-refresh based challenge. It's just that the type of people who use Cloudflare or Anubis almost always just deploy the default (mostly broken) configs that block as many human people as bots. And they never realize it because they only measure such things with javascript.
Over the past few years I've read far more comments complaining about Cloudflare doing it than Anubis. In fact, this discussion section is the first time I've seen people talking about Anubis.
TO BE FAIR
I dislike those even more.
It sounds like you're saying that it's not the proof-of-work that's stopping AI scrapers, but the fact that Anubis imposes an unusual flow to load the site.
If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.
> If that's true Anubis should just remove the proof-of-work part
This is my very strong belief. To make it even clearer how absurd the present situation is, every single one of the proof-of-work systems I’ve looked at has been using SHA-256, which is basically the worst choice possible.
Proof-of-work is bad rate limiting which depends on a level playing field between real users and attackers. This is already a doomed endeavour. Using SHA-256 just makes it more obvious: there’s an asymmetry factor in the order of tens of thousands between common real-user hardware and software, and pretty easy attacker hardware and software. You cannot bridge such a divide. If you allow the attacker to augment it with a Bitcoin mining rig, the efficiency disparity factor can go up to tens of millions.
These proof-of-work systems are only working because attackers haven’t tried yet. And as long as attackers aren’t trying, you can settle for something much simpler and more transparent.
If they were serious about the proof-of-work being the defence, they’d at least have started with something like Argon2d.
The proof of work isn't really the crux. They've been pretty clear about this from the beginning.
I'll just quote from their blog post from January.
https://xeiaso.net/blog/2025/anubis/
Anubis also relies on modern web browser features:
- ES6 modules to load the client-side code and the proof-of-work challenge code.
- Web Workers to run the proof-of-work challenge in a separate thread to avoid blocking the UI thread.
- Fetch API to communicate with the Anubis server.
- Web Cryptography API to generate the proof-of-work challenge.
This ensures that browsers are decently modern in order to combat most known scrapers. It's not perfect, but it's a good start.
This will also lock out users who have JavaScript disabled, prevent your server from being indexed in search engines, require users to have HTTP cookies enabled, and require users to spend time solving the proof-of-work challenge.
This does mean that users using text-only browsers or older machines where they are unable to update their browser will be locked out of services protected by Anubis. This is a tradeoff that I am not happy about, but it is the world we live in now.
3 replies →
All this is true, but also somewhat irrelevant. In reality the amount of actual hash work is completely negligible.
For usability reasons Anubus only requires that you to go trough a the proof of work flow only once in a given period. (I think the default is once per week.) That's just very little work.
Detecting you need to occasionally send a request trough a headless browser far more of a hassle than the PoW. If you prefer LLMs rather than normal internet search, it'll probably consume far more compute as well.
3 replies →
This is basically what most of the challenge types in go-away (https://git.gammaspectra.live/git/go-away/wiki/Challenges) do.
+1 for go-away. It's a bit more involved to configure, but worth the effort imo. It can be considerably more transparent to the user, triggering the nuclear PoW check less often, while being just as effective, in my experience.
I feel like the future will have this, plus ads displayed while the work is done, so websites can profit while they profit.
Every now and then I consider stepping away from the computer job, and becoming a lumberjack. This is one of those moments.
3 replies →
adCAPTCHA already does this:
https://adcaptcha.com
2 replies →
Exactly this.
I don't think anything will stop AI companies for long. They can do spot AI agentic checks of workflows that stop working for some reason and the AI can usually figure out what the problem is and then update the workflow to get around it.
This was obviously dumb when it launched:
1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.
2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.
Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.
> This was obviously dumb when it launched:
Yes. Obviously dumb but also nearly 100% successful at the current point in time.
And likely going to stay successful as the non-protected internet still provides enough information to dumb crawlers that it’s not financially worth it to even vibe-code a workaround.
Or in other words: Anubis may be dumb, but the average crawler that completely exhausting some sites resources is even dumber.
And so it all works out.
And so the question remains: how dumb was it exactly, when it works so well and continues to work so well?
> Yes. Obviously dumb but also nearly 100% successful at the current point in time.
Only if you don't care about negatively affecting real users.
1 reply →
does it work well? I run chromium controlled by playwright for scraping and typically make Gemini implement the script for it because it's not worth my time otherwise. -but I'm not crawling the Internet generally (which I think there is very little financial incentive to do; it's a very expensive process even ignoring Anubis et al); it's always that I want something specific and am sufficiently annoyed by lack of API.
regarding authentication mentioned elsewhere, passing cookies is no big deal.
1 reply →
Does it actually? I don't think I've seen a case study with hard numbers.
3 replies →
the workaround is literally just running a headless browser, and that's pretty much the default nowadays.
if you want to save some $$$ you can spend like 30 minutes making a cracker like in the article. just make it multi threaded, add a queue and boom, your scraper nodes can go back to their cheap configuration. or since these are AI orgs we're talking about, write a gpu cracker and laugh as it solves challenges far faster than any user could.
custom solutions aren't worth it for individual sites, but with how widespread anubis is it's become worth it.
I agree. Your estimate for (2), about 0.0022 kWh, corresponds to about a sixth of the charge of an iPhone 15 pro and would take longer than ten minutes on the phone, even at max power draw. It feels about right for the amount of energy/compute of a large modern MoE loading large pages of several 10k tokens. For example this tech (couple month old) could input 52.3k tokens per second to a 672B parameter model, per H100 node instance, which probably burns about 6–8kW while doing it. The new B200s should be about 2x to 3x more energy efficient, but your point still holds within an order of magnitude.
https://lmsys.org/blog/2025-05-05-large-scale-ep/
The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).
And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.
At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.
The point is that scraping is already inherently cost-intensive so a small additional cost from having to solve a challenge is not going to make a dent in the equation. It doesn't matter what server is doing what for that.
3 replies →
The problem is that 7 + 2 on a submission form only affects people who want to submit something, Anubis affects every user who wants to read something on your site
The question then is why read only users are consuming so much resources that serving them big chunks of JS instead reduces loads of the server. Maybe improve you rendering and/or caching before employing DRM solutions that are doomed to fail anyway.
The problem it's originally fixing is bad scrapers accessing dynamic site content that's expensive to produce, like trying to crawl all diffs in a git repo, or all mediawiki oldids. Now it's also used on mostly static content because it is effective vs scrapers that otherwise ignore robots.txt.
The author make it very clear that he understands the problem Anubis is attempting to solve. His issue is that the chosen approach doesn't solve that problem; it just inhibits access to humans, particularly those with limited access to compute resources.
That's the opposite of being dismissive. The author has taken the time to deeply understand both the problem and the proposed solution, and has taken the time to construct a well-researched and well-considered argument.
> This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.
> This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.
You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.
It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.
But for actual live users who don't see anything but a transient screen, Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
The question is if this is the sweet spot, and I can't find anyone doing the comparative study (how many annoyed human visitors, how many humans stopped and, obviously, how many bots stopped).
> Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
Most CAPTCHAs are invisible these days, and Anubis is worse than them. Also, CAPTCHAs are not normally deployed just for visiting a site, they are mostly used when you want to submit something.
4 replies →
No, it’s exactly because I understand that it bothers me. I understand it will be effective against bots for a few months and best, and legitimate human users will be stuck dealing with the damn thing for years to come. Just like captchas.
This arms race will have a terminus. The bots will eventually be indistinguishable from humans. Some already are.
> The bots will eventually be indistinguishable from humans
Not until they get issued government IDs they won't!
Extrapolating from current trends, some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade, and naturally, this will be included in the anti-bot arsenal. It will be up to the site operator to trust identities signed by the Russian government.
1. Despite what Sam Altman's eyeball company will try to sell you, government registers will always be the anchor of trust for proof-of-identity, they've been doing it for centuries and have become good at it and have earned the goodwill.
How does this work, though?
We can't just have "send me a picture of your ID" because that is pointlessly easy to spoof - just copy someone else's ID.
So there must be some verification that you, the person at the keyboard, is the same person as that ID identifies. The UK is rapidly finding out that that is extremely difficult to do reliably. Video doesn't really work reliably on all cases, and still images are too easily spoofed. It's not really surprising, though, because identifying humans reliably is hard even for humans.
If we do it at the network level - like assigning a government-issued network connection to a specific individual, so the system knows that any traffic from a given IP address belongs to that specific individual. There are obvious problems with this model, not least that IP addresses were never designed for this, and spoofing an IP becomes identity theft.
We also do need bot access for things, so there must be some method of granting access to bots.
I think that to make this work, we'd need to re-architect the internet from the ground up. To get there, I don't think we can start from here.
25 replies →
Can't wait to sign into my web browser with my driver's license.
6 replies →
The internet would come to a grinding halt as everyone would suddenly become mindful of their browsing. It's not hard to imagine a situation where, say, pornhub sells its access data and the next day you get sacked at your teaching job.
32 replies →
Eyeball company play is to be a general identity provider, which is an obvious move for anyone who tries to fill this gap. You can already connect your passport in the World app.
https://world.org/blog/announcements/new-world-id-passport-c...
1 reply →
> some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade
I believe this is likely, and implemented in the right way, I think it will be a good thing.
A zero-knowledge way of attesting persistent pseudonymous identity would solve a lot of problems. If the government doesn’t know who you are attesting to, the service doesn’t know your real identity, services can’t correlate users, and a service always sees the same identity, then this is about as privacy-preserving as you can get with huge upside.
A social media site can ban an abusive user without them being able to simply register a new account. One person cannot operate tens of thousands of bot profiles. Crawlers can be banned once. Spammers can be locked out of email.
5 replies →
At this future point, AI firms will simply rent people’s identities to use online.
2 replies →
This has quite nasty consequences for privacy. For this reason, alternatives are desirable. I have less confidence on what such an alternative should be, however.
10 replies →
Fun fact: The Norwegian wine monopoly is rolling out exactly this to prevent scalpers buying up new releases. Each online release will require a signup in advance with a verified account.
Eh? With the "anonymous" models that we're pushing for right now, nothing stops you from handing over your verification token (or the control of your browser) to a robot for a fee. The token issued by the verifier just says "yep, that's an adult human", not "this is John Doe, living at 123 Main St, Somewhere, USA". If it's burned, you can get a new one.
If we move to a model where the token is permanently tied to your identity, there might be an incentive for you not to risk your token being added to a blocklist. But there's no shortage of people who need a bit of extra cash and for whom it's not a bad trade. So there will be a nearly-endless supply of "burner" tokens for use by trolls, scammers, evil crawlers, etc.
3 replies →
Can't wait to start my stolen id as a service for the botnets
Maybe there will be a way to certify humanness. Human testing facility could be a local office you walk over to get your “I am a human” hardware key. Maybe it expires after a week or so to ensure that you are still alive.
But if that hardware key is privacy-preserving (i.e. websites don't get your identity when you use it), what prevents you from using it for your illegal activity? Scrapers and spam are built by humans, who could get such a hardware key.
4 replies →
It will be hard to tune them to be just the right level of ignorant and slow as us though!
Soon enough there will be competing Unicode characters that can remove exclamation points.
It's been going on for decades now too. It's a cat and mouse game that will be with us for as long as people try to exploit online resources with bots. Which will be until the internet is divided into nation nets, suffocated by commercial interests, and we all decide to go play outside instead.
No. This went into overdrive in the "AI" (crawlers for massive LLM for ML chatbot) era.
Frankly it's something I'm sad we don't yet see a lawsuit for similar to the times v OpenAI. A lot of "new crawlers" claim to innocently forget about established standards like robots.txt
I just wish people would name and shame the massive companies at the top stomping on the rest of the internet in an edge to "get a step up over the competition".
That doesn't really challenge what I said, there's not much "different this time" except the scale is commensurate to the era. Search engine crawlers used to take down websites as well.
I understand and agree with what you are saying though, the cat and mouse is not necessarily technical. Part of solving the searchbot issue was also social, with things like robots.txt being a social contract between companies and websites, not a technical one.
Yes, this is not a problem that will be solved with technical measures. Trying to do so is only going to make the web worse for us humans.
The cost benefit calculus for workarounds changes based on popularity. Your custom lock might be easy to break by a professional, but the handful of people who might ever care to pick it are unlikely to be trying that hard. A lock which lets you into 5% of houses however might be worth learning to break.
Sure.
It might be a tool in the box. But it’s still cat and mouse.
In my place we quickly concluded the scrapers have tons of compute and the “proof-of-work” aspect was meaningless to them. It’s simply the “response from site changed, need to change our scraping code” aspect that helps.
> The point is that they hadn't, and this worked for quite a while.
That's what I was hoping to get from the "Numbers" section.
I generally don't look up the logs or numbers on my tiny, personal web spaces hosted on my server, and I imagine I could, at some point, become the victim of aggressive crawling (or maybe I have without noticing because I've got an oversized server on a dual link connection).
But the numbers actually only show the performance of doing the PoW, not the effect it has had on any site — I am just curious, and I'd love it if someone has done the analysis, ideally grouped by the bot type ("OpenAI bot was responsible for 17% of all requests, this got reduced from 900k requests a day to 0 a day"...). Search, unfortunately, only gives me all the "Anubis is helping fight aggressive crawling" blog articles, nothing with substance (I haven't tried hard, I admit).
Edit: from further down the thread there's https://dukespace.lib.duke.edu/server/api/core/bitstreams/81... but no analysis of how many real customers were denied — more data would be even better
>But it takes a human some time and work to tell the crawler HOW.
Yes, for these human-based challenges. But this challenge is defined in code. It's not like crawlers don't run JavaScript. It's 2025, they all use headless browsers, not curl.
If you are going to rely on security through obscurity there are plenty of ways to do that that won't block actual humans because they dare use a non-mainstream browser. You can also do it without displaying cringeworthy art that is only there to get people to pay for the DRM solution you are peddling - that shit has no place in the open source ecosystem.
On the contrary: Making things look silly and unprofessional so that Big Serious Corporations With Money will pay thousands of dollars to whitelabel them is an OUTSTANDING solution for preserving software freedom while raising money for hardworking developers.
I'd rather not raise money for "hardworking" developers if their work is spreading DRM on the web. And it's not just "Big Serious Corporations" that don't want to see your furry art.
6 replies →
Sounds like a similar idea to what the "plus N-word license" is trying to accomplish
I deployed a proof of work based auth system once where every single request required hashing a new nonce. Compare with Anubis where only one request a week requires it. The math said doing it that frequently, and with variable argon params the server could tune if it suspected bots, would be impactful enough to deter bots.
Would I do that again? Probably not. These days I’d require a weekly mDL or equivalent credential presentation.
I have to disagree that an anti-bot measure that only works globally for a few weeks until bots trivially bypass it is effective. In an arms race against bots the bots win. You have to outsmart them by challenging them to do something that only a human can do or is actually prohibitively expensive for bots to do at scale. Anubis doesn't pass that test. And now it’s littered everywhere defunct and useless.
With all the SPAs out there, if you want to crawl the entire Web, you need a headless browser running JavaScript. Which will pass Anubis for free.
> As the botmakers circumvent, new methods of proof-of-notbot will be made available.
Yes, but the fundamental problem is that the AI crawler does the same amount of work as a legitimate user, not more.
So if you design the work such that it takes five seconds on a five year old smartphone, it could inconvenience a large portion of your user base. But once that scheme is understood by the crawler, it will delay the start of their aggressive crawling by... well-under five seconds.
An open source javascript challenge as a crawler blocker may work until it gets large enough for crawlers to care, but then they just have an engineer subscribe to changes on GitHub and have new challenge algorithms implemented before the majority of the deployment base migrates.
Wasn't there also weird behaviors reported by webadmins across the world, like crawlers used by LLM companies are fetching evergreen data ad nauseum or something along that? I thought the point of adding PoW than just blocking them was to convince them to at least do it right.
You don't even need to go there. If the damn thing didn't work the site admin wouldn't have added it and kept it.
Sure the program itself is jank in multiple ways but it solves the problem well enough.
Everytime we need to deploy such mechanisms, you reward those that already crawled the data and you penalize newcomers and other honest crawlers.
For some sites Anubis might be fitting, but it should be mindfully deployed.
Many sufficiently technical people take to heart:
- Everything is pwned
- Security through obscurity is bad
Without taking to heart:
- What a threat model is
And settle on a kind of permanent contrarian nihilist doomerism.
Why eat greens? You'll die one day anyway.
On a side note, is the anime girl image customizable? I did a quick Google search an it seems that only the commercial version offers rebranding.
It's free software. The paid version includes an option to change it, and they ask politely that you don't change it otherwise.
As I understand it, this is Proof of Work, which is strictly not a mouse and cat situation.
It is because you are dealing with crawlers that already have a nontrivial cost per page, adding something relatively trivial that is still within the bounds regular users accept won't change the motivations of bad actors at all.
What is the existing cost per page? as far as I know an http request and some string parsing is somewhat trivial, say 14kb of bandwidth per page?
Technical people are prone to black-and-white thinking, which makes it hard to understand that making something more difficult will cause people to do it less even though it’s still possible.
I think the argument on offer is more, this juice isn't worth the squeeze. Each user is being slowed down and annoyed for something that bots will trivially bypass if they become aware of it.
If they become aware of it and actually think it’s worthwhile. Malicious bots work by scaling, and implementing special cases for every random web site doesn’t scale. And it’s likely they never even notice.
3 replies →
Did you read the article? OP doesn't care about bots figuring it out. It's about the compute needed to do the work.
It's quite an interesting piece, I feel like you projected something completely different onto it.
Your point is valid, but completely adjacent.
Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.
Consider:
An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.
A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.
A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.
And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.
But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
This is also how the Blu-Ray BD+ system worked.
The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.
A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.
According to whom or what data exactly?
AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.
The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.
The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.
Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.
Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?
Right, that's the point of the article. If you can tune asymmetric costs on bots/scrapers, it doesn't matter: you can drive bot costs to infinity without doing so for users. But if everyone's on a level playing field, POW is problematic.
I like your example because the quarters for shopping cards are not universal everywhere. Some societies have either accepted shopping cart shrinkage as an acceptable cost of doing business or have found better ways to deter it.
Scrapers are orders of magnitude faster than humans at browsing websites. If the challenge takes 1 second but a human stays on the page for 3 minutes, then it's negligible. But if the challenge takes 1 second and the scraper does ita job in 5 seconds, you already have a 20% slowdown
4 replies →
For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.
The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.
The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.
OP is a real human user trying to make your DRM work with their system. That you consider this to be an "issue" that should be fixed says a lot.
Right, I get that. I'm just saying that over the long term, you're going to have to find asymmetric costs to apply to scrapers, or it's not going to work. I'm not criticizing any specific implementation detail of your current system. It's good to have a place to take it!
I think that's the valuable observation in this post. Tavis can tell me I'm wrong. :)
> But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
Based on my own experience fighting these AI scrappers, I feel that the way they are actually implemented makes it that in practice there is asymmetry in the work scrappers have to do vs humans.
The pattern these scrappers follow is that they are highly distributed. I’ll see a given {ip, UA} pair make a request to /foo immediately followed by _hundreds_ of requests from completely different {ip, UA} pairs to all the links from that page (ie: /foo/a, /foo/b, /foo/c, etc..).
This is a big part of what makes these AI crawlers such a challenge for us admins. There isn’t a whole lot we can do to apply regular rate limiting techniques: the IPs are always changing and are no longer limited to corporate ASN (I’m now seeing IPs belonging to consumer ISPs and even cell phone companies), and the User Agents all look genuine. But when looking through the logs you can see the pattern that all these unrelated requests are actually working together to perform a BFS traversal of your site.
Given this pattern, I believe that’s what makes the Anubis approach actually work well in practice. For a given user, they will encounter the challenge once when accessing the site the first time, then they’ll be able to navigate through it without incurring any cost. While the AI scrappers would need to solve the challenge for every single one of their “nodes” (or whatever it is they would call their {ip, UA} pairs). From a site reliability perspective, I don’t even care if the crawlers manage to solve the challenge or not. That it manages to slow them down enough to rate limit them as a network is enough.
To be clear: I don’t disagree with you that the cost incurred by regular human users is still high. But I don’t think it’s fair to say that this is not a situation in which the cost to the adversary is not asymmetrical. It wouldn’t be if the AI crawlers hadn’t converged towards an implementation that behaves as a DDOS botnet.
The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:
botPain = nBotRequests * cpuWorkPerRequest * dollarsPerCpuSecond
humanPain = c_1 * max(elapsedTimePerRequest) + c_2 * avg(elapsedTimePerRequest)
The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:
1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)
2. Parallel computation (this improves the human experience with no effect for bots)
ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.
That's exactly what I'm saying isn't happening: the user pays some cost C per article, and the bot pays exactly the same cost C. Both obtain the same reward. That's not how Hashcash works.
5 replies →
> The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.
If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.
I'm not moralizing, I'm talking about whether it can work. If it's your site, you don't need to justify putting anything in front of it.
5 replies →
> There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
That depends on what you count as normal users though. Users that want to use alternative players also have to deal with this and since yt-dlp and youtube-dl before have been able to provide a solution for those user and bots can just do the same I'm not sure if I'd call the scheme successful in any way.
[flagged]
Also, it forces the crawler to gain code execution capabilities, which for many companies will just make them give up and scrape someone else.
I don't know if you've noticed, but there's a few websites these days that use javascript as part of their display logic.
Yes, and those sites take way more effort to crawl than other sites. They may still get crawled, but likely less often than the ones that don't use JavaScript for rendering (which is the main purpose of Anubis - saving bandwidth from crawlers who crawl sites way too often).
(Also, note the difference between using JavaScript for display logic and requiring JavaScript to load any content at all. Most websites do the first, the second isn't quite as common.)
The fundamental failure of this is that you can’t publish data to the web and not publish data to the web. If you make things public, the public will use it.
It’s ineffective. (And furry sex-subculture propaganda pushed by its author, which is out of place in such software.)
The misguided parenthetical aside, this is not about resources being public, this is about bad actors accessing those resources in a highly inefficient and resource-intensive manner, effectively DDOS-ing the source.
>And furry sex-subculture propaganda pushed by its author
if your first thought when seeing a catgirl is sex, i got bad news for you