How does archive.today work? How do they a paywall free version of web pages?

1 year ago

I would guess they pretend to be a search engine crawler, which are usually allowed to see paywalled pages. There are ways to reliably detect fake crawlers but few paywalled sites seem to bother doing that.

  • They are also running the Javascript at archive time: "Saved pages will have no active elements and no scripts, so they keep you safe as they cannot have any popups or malware!"

    As to how they (1) survive, and (2) don't face technical countermeasures I'll say:

    (A) Rumor has it they're based in Russia

    (B) Just to do their job (getting the facts straight) any journalist from any publication has to be be able to read every article from every publication. The secret broke that you could log into the WSJ and many other publications with "media/media" [1]. Given the 5,000+ newspapers in the US, any kind of mutual pact to give each other access is a difficult problem. (e.g. Is the New York Times going to step down from it's throne to buy access to a huge number of no-status small town papers, even if any of them could be the authoritative source for a huge story?)

    If there wasn't archive.today they'd have to make one.

    [1] https://www.inc.com/bill-murphy-jr/free-login-wall-street-jo...

  • Doesn't seem like it, since it apparently sends a user agent for an old version of Windows Chrome and not a crawler user agent

    https://archive.ph/4xptU

    I guess it could have a whitelist of paywalled sites that it pretends to be a crawler for, and just uses the Chrome UA for everything else.