Comment by hedora

3 months ago

This was obviously dumb when it launched:

1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.

2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.

Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.

16 comments

hedora

pilif 3 months ago

> This was obviously dumb when it launched:

Yes. Obviously dumb but also nearly 100% successful at the current point in time.

And likely going to stay successful as the non-protected internet still provides enough information to dumb crawlers that it’s not financially worth it to even vibe-code a workaround.

Or in other words: Anubis may be dumb, but the average crawler that completely exhausting some sites resources is even dumber.

And so it all works out.

And so the question remains: how dumb was it exactly, when it works so well and continues to work so well?

account42 3 months ago
> Yes. Obviously dumb but also nearly 100% successful at the current point in time.
Only if you don't care about negatively affecting real users.
- pilif 3 months ago
  
  I understand this as an argument that it’s better to be down for everyone than have a minority of users switch browsers.
  I’m not convinced by that makes sense.
  Now ideally you would have the resources to serve all users and all the AI bots without performance degradation, but for some projects that’s not feasible.
  In the end it’s all a compromise.
kldg 3 months ago
does it work well? I run chromium controlled by playwright for scraping and typically make Gemini implement the script for it because it's not worth my time otherwise. -but I'm not crawling the Internet generally (which I think there is very little financial incentive to do; it's a very expensive process even ignoring Anubis et al); it's always that I want something specific and am sufficiently annoyed by lack of API.
regarding authentication mentioned elsewhere, passing cookies is no big deal.
- eaglefield 3 months ago
  
  Anubis is not meant to stop single endpoints from scraping. It's meant to make it harder for massive AI scrapers. The problematic ones evade rate limiting by using many different ip addresses, and make scraping cheaper on themselves by running headless. Anubis is specifically built to make that kind of scraping harder as i understand it.
bananalychee 3 months ago
Does it actually? I don't think I've seen a case study with hard numbers.
- pilif 3 months ago
  
  Here’s one study
  https://dukespace.lib.duke.edu/server/api/core/bitstreams/81...
  And of all the high-profile projects implementing it, like the LKML archives, none have backed down yet, so I’m assuming the initial improvement in numbers must continue or it would have been removed since
  
  2 replies →
snickerdoodle12 3 months ago

the workaround is literally just running a headless browser, and that's pretty much the default nowadays.
if you want to save some $$$ you can spend like 30 minutes making a cracker like in the article. just make it multi threaded, add a queue and boom, your scraper nodes can go back to their cheap configuration. or since these are AI orgs we're talking about, write a gpu cracker and laugh as it solves challenges far faster than any user could.
custom solutions aren't worth it for individual sites, but with how widespread anubis is it's become worth it.

pama 3 months ago

I agree. Your estimate for (2), about 0.0022 kWh, corresponds to about a sixth of the charge of an iPhone 15 pro and would take longer than ten minutes on the phone, even at max power draw. It feels about right for the amount of energy/compute of a large modern MoE loading large pages of several 10k tokens. For example this tech (couple month old) could input 52.3k tokens per second to a 672B parameter model, per H100 node instance, which probably burns about 6–8kW while doing it. The new B200s should be about 2x to 3x more energy efficient, but your point still holds within an order of magnitude.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

rob_c 3 months ago

The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).

And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.

At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.

account42 3 months ago
The point is that scraping is already inherently cost-intensive so a small additional cost from having to solve a challenge is not going to make a dent in the equation. It doesn't matter what server is doing what for that.
- mistercheph 3 months ago
  
  100 billion web pages * 0.02 USD of PoW/page = 2 billion dollars, the point is not to stop every scraper/crawler, the point is to raise the costs enough to avoid being bombarded by all of them
  
  2 replies →