Comment by notpushkin

4 months ago

My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.

E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...

But if you run this, you get the page content straight away:

  curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b

I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.

15 comments

notpushkin

rezonant 4 months ago

It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.

samlinnfer 4 months ago
This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.
https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
- 1vuio0pswjnm7 4 months ago
  
  It's only recently, within the last three months IIRC, that Wikipedia started requiring a UA header
  I know because as a matter of practice I do not send one. Like I do with most www sites, I used Wikipedia for many years without ever sending a UA header. Never had a problem
  I read the www text-only, no graphical browser, no Javascript
hshdhdhehd 4 months ago
What if everyone requests from the bot has a different UA?
- skylurk 4 months ago
  
  Success. The goal is to differentiate users and bots who are pretending to be users.
- trenchpilgrim 4 months ago
  
  Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.
hsbauauvhabzb 4 months ago
Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?
- gucci-on-fleek 4 months ago
  
  Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.
  [0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
  [1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

xena 4 months ago

This was a tactical decision I made in order to avoid breaking well-behaved automation that properly identifies itself. I have been mocked endlessly for it. There is no winning.

seba_dos1 4 months ago

The winning condition does not need to consider people who write before they think.
ranger_danger 4 months ago
How is a curl user-agent automatically a well-behaved automation?
- fragmede 4 months ago
  
  One assumes it is a human, running curl manually, from the command line on a system they're authorized to use. It's not wget -r.
  
  2 replies →

seba_dos1 4 months ago

> I’m pretty sure this gets abused by AI scrapers a lot.

In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?