Comment by eqvinox

3 months ago

TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.

Any access will fall into either of the following categories:

- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.

- amnesiac (no cookies) clients with JS. Each access is now expensive.

(- no JS - no access.)

The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.

10 comments

eqvinox

qwery 3 months ago

Yes, I think you're right. The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.

I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].

In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?

[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.

[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.

1gn15 3 months ago
> In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
I agree, but the advertising is the whole issue. "Checking to see you're not a bot!" and all that.
Therefore some people using Anubis expect it to be an impenetrable wall, to "block AI scrapers", especially those that believe it's a way for them to be excluded from training data.
It's why just a few days ago there was a HN frontpage post of someone complaining that "AI scrapers have learnt to get past Anubis".
But that is a fight that one will never win (analog hole as the nuclear option).
If it said something like "Wait 5 seconds, our servers are busy!", I would think that people's expectations will be more accurate.
As a robot I'm really not that sympathetic to anti-bot language backfiring on humans. I have to look away every time it comes up on my screen. If they changed their language and advertising, I'll be more sympathetic -- it's not as if I disagree that overloading servers for not much benefit is bad!
- qwery 3 months ago
  
  Yeah, I think it's obviously a pretty natural conclusion to draw, that {thing for hinder crawler} ≅≅ {thing for stop all crawler}. Perhaps I should have stated that explicitly in the original comment.
  As for the presentation/advertising, I didn't get into it because I don't hold a particularly strong opinion. Well, I do hold a particularly strong opinion, but not one that really distinguishes Anubis from any of the other things. I'm fully onboard with what you're saying -- I find this sort of software extremely hostile and the fact that so many people don't[0] reminds me that I'm not a people.
  In my experience, this particular jump scare is about the same as any of the other services. The website is telling me that I'm not welcome for whatever arbitrary reason it is now, and everyone involved wants me to feel bad.
  Actually there is one thing I like about the Anubis experience[1] compared to the other ones, it doesn't "Would you like to play a game?" me. As a robot I appreciate the bluntness, I guess.
  (the games being: "click on this. now watch spinny. more. more. aw, you lose! try again?", and "wheel, traffic light, wildcard/indistinguishable"[2]).
  [0] "just ignore it, that's what I do" they say. "Oh, I don't have a problem like that. Sucks to be you."
  [1] yes, I'm talking upsides about the experience of getting **ed by it. I would ask how we got here but it's actually pretty easy to follow.
  [2] GCHQ et al. should provide a meatspace operator verification service where they just dump CCTV clips and you have to "click on the squares that contain: UNATTENDED BAG". Call it "phonebooth, handbag, foreign agent".
  (Apologies for all the weird tangents -- I'm just entertaining myself, I think I might be tired.)

thayne 3 months ago

You don't necessarily need JS, you just need something that can detect if Anybis is used and complete the challenge.

eqvinox 3 months ago

Sure, doesn't change anything though; you still need to spend energy on a bunch of hash calculations.
rocqua 3 months ago

But then you rate limit that challenge.
You could setup a system for parellelizing the creation of these Anubis PoW cookies independent of the crawling logic. That would probably work, but it's a pretty heavy lift compared to 'just run a browser with JavaScript'.

rocqua 3 months ago

This is a good point, presuming the rate limiting is actually applied.

dlenski 3 months ago

> Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit.

This is a nice explanation. It's much clearer than anything I've seen offered by Anubis’s authors, in terms of why or how it could be effective at preventing a site from being ravaged by hordes of ill-behaved bots.

IshKebab 3 months ago

Well maybe, but even then, how many parallel crawls are you going to do per site? 100 maybe? You can still get enough keys to do that for all sites in just a few hours per week.