Comment by rob_c

2 days ago

The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).

And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.

At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.

The point is that scraping is already inherently cost-intensive so a small additional cost from having to solve a challenge is not going to make a dent in the equation. It doesn't matter what server is doing what for that.

  • 100 billion web pages * 0.02 USD of PoW/page = 2 billion dollars, the point is not to stop every scraper/crawler, the point is to raise the costs enough to avoid being bombarded by all of them

    • Yes, but it's not going to be 0.02 USD of PoW per page! That is an absurd number. It'd mean a two-hour proof of work for a server CPU, a ten hour proof of work for a phone.

      In reality you can do maybe a 1/10000th of that before the latency hit to real users becomes unacceptable.

      And then, the cost is not per page. The cost is per cookie. Even if the cookie is rate-limited, you could easily use it for 1000 downloads.

      Those two errors are multiplicative, so your numbers are probably off by about 7 orders of magnitudes. The cost of the PoW is not going to be $2B, but about $200.

    • I'm going to phrase the explanation like this in the future. Couldn't have said it better myself.