Comment by xurukefi
2 days ago
Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.
> I figured that they have found an (automated) way to imitate Googlebot really well.
If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.
There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.
That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.
I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P
Presumably they are just matching on *Google* and calling it a day.
1 reply →
> which I've configured to redirect to a paywalled news article.
Which specific site with a paywall?
> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.
The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.
I hope they haven't been stealing cookies from actual users through a botnet or something.
Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.
You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.
It would be challenging to do with text, but is certainly doable with images - and articles contain those.
You need that sort of thing (i.e. watermarking) when people are intentionally trying to hide who did it.
In the archive.today case, it looks pretty automated. Surely just adding an html comment would be sufficient.
2 replies →
> which is, of course, ridiculous.
Why? in the world of web scrapping this is pretty common.
Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.
Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.
Using two or more accounts could help you automatically strip account details.
1 reply →
Do you know where the doxxed info ultimately originates from? It turns out that the archives leaked account names. Try Googling what happened to volth on Github.
Replace any identifiers like usernames and emails with another string automatically.
I could be wrong, but I think I've seen it fail on more obscure sites. But yeah it seems unlikely they're maintaining so many premium accounts. On the other hand they could simply be state-backed. Let's say there are 1000 likely paywalled sites, 20 accounts for each = 20k accounts, $10/month => $200k/month = $2.4m a year. If I were an intelligence agency I'd happily drop that plus costs to own half the archived content on the internet.
Surely it wouldn't be too hard to test. Just set up an unlisted dummy paywall site, archive it a few times and see what the requests looks like.
1 reply →
I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.
.. but what about subscription only, paywalled sources?
many publisher's offer "first one's free".
For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.
It's because it's actively maintained, and bypassing the paywalls is its whole selling point, thus, they do have to be good at it.
They bypass the rendering issues by "altering" the webpages. It's not uncommon to archive a page, and see nothing because of the paywalls; but then later on, the same page is silently fixed. They have a Tumblr where you can ask them questions; at one point, it's been quite common for everyone to ask them to fix random specific pages, which they did promptly.
Honestly, you cannot archive a modern page, unless you alter it. Yet they're now being attacked under the pretence of "altering" webpages, but that's never been a secret, and it's technologically impossible to archive without altering.
There's a pretty massive difference between altering a snapshot to make it archivable/readable and doing it to smear and defame a blogger who wrote about you.
I imagine accounts are the only way that archive.today works on sites like 404media.co that seem to have server sided paywalls. Similarly, twitter has a completely server sided paywall.
It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.
But it is reliable in the sense that if it works for a site, then it usually never fails.
no tool is 100% effective. Archive.today is the best one we've seen