Comment by notpushkin
4 months ago
My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.
E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.
It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.
This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.
https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
It's only recently, within the last three months IIRC, that Wikipedia started requiring a UA header
I know because as a matter of practice I do not send one. Like I do with most www sites, I used Wikipedia for many years without ever sending a UA header. Never had a problem
I read the www text-only, no graphical browser, no Javascript
What if everyone requests from the bot has a different UA?
Success. The goal is to differentiate users and bots who are pretending to be users.
Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.
Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?
Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.
[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
This was a tactical decision I made in order to avoid breaking well-behaved automation that properly identifies itself. I have been mocked endlessly for it. There is no winning.
The winning condition does not need to consider people who write before they think.
How is a curl user-agent automatically a well-behaved automation?
One assumes it is a human, running curl manually, from the command line on a system they're authorized to use. It's not wget -r.
2 replies →
> I’m pretty sure this gets abused by AI scrapers a lot.
In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?