Comment by wuschel
6 days ago
There is an post describing the possibility of an organised campaign against archive.today [1] https://algustionesa.com/the-takedown-campaign-against-archi...
How does the tech behind archive.today work in detail? Is there any information out there that goes beyond the Google AI search reply or this HN thread [2]?
[1] https://algustionesa.com/the-takedown-campaign-against-archi... [2] https://news.ycombinator.com/item?id=42816427
If they're under an organised defamation campaign, they're not helping themselves by DDoSing someone else's blog and editing archived pages.
Is that, itself, true or disinformation?
They did edit archived pages. They temporarily did a find/replace on their archive to replace "Nora Puchreiner" (an alias the site operator uses) with "Jani Patokallio" (the name of the blogger who wrote about archive.today's owner). https://megalodon.jp/2026-0219-1634-10/https://archive.ph:44...
They also tampered with their archive for a few of the social media sites (Twitter, Instagram, Blogger) by changing the name of the signed in account to Jani Patokallio. https://megalodon.jp/2026-0220-0320-05/https://archive.is:44...
I think Wikipedia made the right decision, you can't trust an archival service for citations if every time the sysop gets in a row they tamper with their database.
3 replies →
I've not seen any evidence of them editing archived pages BUT the DDOSing of gyrovague.com is true and still actively taking place. The author of that blog is Finnish leading archive.today to ban all Finnish IPs by giving them endless captcha loops. After solving the first captcha, the page reloads and a javascript snippet appears in the source that attempts to spam gyrovague.com with repeated fetches.
8 replies →
It was true and visible when reported, yeah.
1 reply →
I've also noticed archive.today injecting suspicious looking ads into archived pages that originally did not have ads.
It's true.
https://archive-is.tumblr.com/post/808911640210866176/people...
it gives them a voice.
And that voice is practically shouting, "I AM UNTRUSTWORTHY".
5 replies →
archive.today works surprisingly well for me, often succeeding where archive.org fails.
archive.org also complies with takedown requests, so it's worth asking: could the organised campaign against archive.today have something to do with it preserving content that someone wants removed?
They preserve a lot of paywalled content so yeah I'm sure there's enough financial incentives to bother them :(
[dead]
There was also the recent news about sites beginning to block the Internet Archive. Feels like we are gearing up for the next phase of the information war.
[flagged]
Was that written by AI? It sounds like AI, spends lots of time summarizing other posts, and has no listed author. My AI alarm is going off.
Ars was caught recently using AI to write articles when the AI hallucinated about a blogger getting harassed by someone using AI agents. The article quoted his blog and all the quotes were nonsense.
Even if something is AI generated the author, and the editor, should at least attempt to read back the article. English isn't my native language, so that obviously plays in, but very frequently I find that articles I struggle to read are AI generated, they certainly have that AI feel.
It would be interesting to run the numbers, but I get the feeling that AI generated articles may have a higher LIX number. Authors are then less inclined to "fix" the text, because longer word makes them seem smarter.
2 replies →
Yeah, wow. Definitely setting off my AI summary alarm.
Yeah nearly certainly.
A big fear of mine is something happening to archive.is
There is so much is archived there, to lose it all would be a tragedy.
There are number of blog posts like
owner-archive-today . blogspot . com
2 years old, like J.P's first post on AT
They are able to scrape paywalled sites at random, so im guessing a residential botnet is used.
It's funny that residential VPN botnets aren't uncommon now. "Free VPN" if you allow your computer/phone to be an exit point.
But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:
https://megalodon.jp/2026-0221-0304-51/https://d914s229qk4kj...
https://archive.is/Y7z4E
The second shows volth's Github notifications. Volth was a major nix-pkgs contributor, but his Github account disappeared.
https://github.com/orgs/community/discussions/58164
There are some pretty robust browser addons for bypassing article paywalls, notably https://gitflic.ru/project/magnolia1234/bypass-paywalls-fire...
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
5 replies →
I thought saved pages sometimes do contain users' IP's?
https://www.reddit.com/r/Advice/comments/5rbla4/comment/dd5x...
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
> But how do they bypass the paywall?
I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.
> Otherwise, the saved page would contain information about the logged-in user.
If you read this article, theres plenty of evidence they are manipulating the scraped data.
But I’m just speculating here…
2 replies →