Wikipedia deprecates Archive.today, starts removing archive links

10 hours ago (arstechnica.com)

Related:

Archive.today is directing a DDoS attack against my blog - https://news.ycombinator.com/item?id=46624740 - Jan 2026 (69 comments)

There is an post describing the possibility of an organised campaign against archive.today [1] https://algustionesa.com/the-takedown-campaign-against-archi...

How does the tech behind archive.today work in detail? Is there any information out there that goes beyond the Google AI search reply or this HN thread [2]?

[1] https://algustionesa.com/the-takedown-campaign-against-archi... [2] https://news.ycombinator.com/item?id=42816427

  • archive.today works surprisingly well for me, often succeeding where archive.org fails.

    archive.org also complies with takedown requests, so it's worth asking: could the organised campaign against archive.today have something to do with it preserving content that someone wants removed?

  • There was also the recent news about sites beginning to block the Internet Archive. Feels like we are gearing up for the next phase of the information war.

    • I said it and it's true. The fake news is lying and saying it isn't true. Someone posted on my behalf saying it was true, but they were wrong. I never said it's true, and anyone who says otherwise is lying. Nobody ever said anything like that and you're crazy for even thinking it.

  • Was that written by AI? It sounds like AI, spends lots of time summarizing other posts, and has no listed author. My AI alarm is going off.

  • There are number of blog posts like

    owner-archive-today . blogspot . com

    2 years old, like J.P's first post on AT

I don't see the point in doxing anyone, especially those providing a useful service for the average internet user. Just because you can put some info together, it doesn't mean you should.

With this said, I also disagree with turning everyone that uses archive[.]today into a botnet that DDoS sites. Changing the content of archived pages also raises questions about the authenticity of what we're reading.

The site behaves as if it was infected by some malware and the archived pages can't be trusted. I can see why Wikipedia made this decision.

  • For a very brief time, "doxing" (that is, dropping dox, that is, dropping docs, or documents) used to mean something useful. You gathered information that was not out in public, for example by talking to people or by stealing it, and put it out in the open.

    It's very silly to talk about doxing when all someone has done is gather information anyone else can equally easily obtain, just given enough patience and time, especially when it's information the person in question put out there themselves. If it doesn't take any special skills or connections to obtain the information, but only the inclination to actually perform the research on publicly available data, I don't see what has been done that is unethical.

    • Call it stalking or harrasment if you prefer. Regardless its rude (sometimes illegal) behaviour.

      That's no justification for using visitors to your site to do a DDOS.

      In the slang of reddit: ESH

      7 replies →

  • It's also kind of ironic that a site whose whole premise is to preserve pages forever, whether the people involved like it or not, is seeking to take down another site because they are involved and don't like it. Live by the sword, etc.

    • > It's also kind of ironic that a site whose whole premise is to preserve pages forever, whether the people involved like it or not

      Oddly, I think archive.today has explicitly said that's not what they're there for, and the people shouldn't rely on their links as a long-term archive.

      2 replies →

  • Sites that exist to archive other websites will almost always need to dynamically change the content of the HTML that they're serving in some way or another. (For example, a link that points to the root of the website may need changed in order to point to the right location.)

    So it doesn't necessarily raise questions about whether the content has been changed or not. The difference is in whether that change is there to make the archive usable - and of course, for archive.today, that's not the case.

  • > Changing the content of archived pages also raises questions about the authenticity of what we're reading.

    This is absolutely the buried lede of this whole saga, and needs to be the focus of conversation in the coming age.

  • Did they actually run the DDoS via a script or was this a case of inserting a link and many users clicked it? They are substantially different IMO

  • As far as I understand the person behind archive.today might face jail time if they are found out. You shouldn't be surprised that people lash out when you threaten their life.

    I don't think the DDOSing is a very good method for fighting back but I can't blame anyone for trying to survive. They are definitely the victim here.

    If that blog really doxxed them out of idle curiosity they are an absolute piece of shit. Though I think this is more of a targeted campaign.

    • One thing they always teach you in Crime University is "don't break two laws at the same time." If you have contrabands in your car, don't speed or run red lights, because it brings attention and attentions means jail.

      In this case, I didn't know that the archive.today people were doxxed until they started the ddos campaign and caught attention. I doubt anyone in this thread knew or cared about the blogger until he was attacked. And now this entire thing is a matter of permanent record on Wikipedia and in the news. archive.today's attempt at silencing the blogger is only bringing them more trouble, not less.

      Barbara_Streisand_Mansion.jpg

      1 reply →

    • > As far as I understand the person behind archive.today might face jail time if they are found out. You shouldn't be surprised that people lash out when you threaten their life.

      One of the really strange things about all of this is that there is a public forum post in which a guy claims to be the site owner. So this whole debacle is this weird mix of people who are angry and saying "clearly the owner doesn't want to be associated with the site" on the one hand, but then on the other hand there's literally a guy who says he's the one that owns the site, so it doesn't seem like that guy is very worried about being associated with it?

      It also seems weird to me that it's viewed as inappropriate to report on the results of Googling the guy who said he owns the site, but maybe I'm just out of touch on that topic.

      1 reply →

    • Somebody who a) directs DDOS attacks and b) abuses random visitors' browser for those DDOS attacks is never the victim.

      You don't know their motives for running their site, but you do get a clear message about their character by observing their actions, and you'd do well to listen to that message.

      3 replies →

Has anyone else noticed that some of Archive.today's X/Twitter captures [1] are logged in with an account called "advancedhosters" [2], which is associated with a web hosting company apparently located in Cyprus? The latest post [3] from the account links to a blog post [4] including private communications between the webmaster of Archive.today (using their previously-known "Volth" alias) and a site owner requesting a takedown. Also note that the previous post [5] from the "advancedhosters" account was a link to a pro-Russia, anti-Ukraine article, archived via Archive.today of course. Seems like an interesting lead to untangle.

[1] https://archive.today/20240714173022/https://x.com/archiveis...

[2] https://x.com/advancedhosters

[3] https://x.com/advancedhosters/status/1731129170091004412

[4] https://lj.rossia.org/users/mopaiv/257.html

[5] https://x.com/advancedhosters/status/1501971277099286539

I noticed last year that some archived pages are getting altered.

Every Reddit archived page used to have a Reddit username in the top right, but then it disappeared. "Fair enough," I thought. "They want to hide their Reddit username now."

The problem is, they did it retroactively too, removing the username from past captures.

You can see on old Reddit captures where the normal archived page has no username, but when you switch the tab to the Screenshot of the archive it is still there. The screenshot is the original capture and the username has now been removed for the normal webpage version.

When I noticed it, it seemed like such a minor change, but with these latest revelations, it doesn't seem so minor anymore.

  • > When I noticed it, it seemed like such a minor change, but with these latest revelations, it doesn't seem so minor anymore.

    That doesn't seem nefarious, though. It makes sense they wouldn't want to reveal whatever accounts they use to bypass blocks, and the logged-in account isn't really meaningful content to an archive consumer.

    Now, if they were changing the content of a reddit post or comment, that would be an entirely different matter.

It seems a lot of people havent heard of it, but I think its worth plugging https://perma.cc/ which is really the appropriate tool for something like Wikipedia to be using to archive pages.

mroe https://en.wikipedia.org/wiki/Perma.cc

A bit off topic, but are there any self hosted open source archiving servers people are using for personal usage?

I think ArchiveBox[1] is the most popular. I will give it a shot, but it's a shame they don't support URL rewriting[2], which would be annoying for me. I read a lot of blog and news articles that are split across multiple pages, and it would be nice if that article's "next page" link was a link to the next archived page instead of the original URL.

1: https://archivebox.io/

2: https://github.com/ArchiveBox/ArchiveBox/discussions/1395

https://web.archive.org/web/20260220191245if_/https://arstec...

archive.today is very popular on HN; the opaque, shortened URLs are promoted on HN every day

I can't use archive.today. I tried but gave up. Too many hassles. I might be in the minority but I know I'm not the only one. As it happens. I have not found any site that I cannot access without it

The most important issue with archive.today though is the person running it, their past and present behaviour. It speaks for itself

Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators

  • "archive.today" as used here means the collection of archive.tld domains, where .tld could be ".is", ".md", ".ph", etc.

    "promoted" as used here means placing an archive.tld URL at the top of an HN thread so that many HN readers will follow it, or placing these URLs elsewhere in threads

  • I use archive.today all the time. How do you access pages, like for instance on the economist, without it?

    •    http-request set-header user-agent "Mozilla/5.0 (Linux; Android 14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.103 Mobile Safari/537.36 Lamarr" if { hdr(host) -m end economist.com }
      

      Years ago I used some other workaround that no longer works, maybe something like amp.economist.com. AMP with text-only browser was a useful workaround for many sites

      Workarounds usually don't last forever. Websites change from time to time. This one will stop working at some point

      There are some people who for various reasons cannot use archive.today

    • If dang and tomhow enforce a policy against paywalled content would garner less interest in accessing those pages via third parties. Most news gets reported by multiple outlets in general, so the same discussions would still surface.

  • you can change the tld of any archive.today link if .today doesn't work. for example archive.ph, archive.is, archive.md, etc

    • There's a DNS issue between Archive Today and some ISPs which causes their domains not to resolve properly, which is why some people have a lot of trouble using it.

  • The fact is i cant have a discussion about a paywalled article without reading it. Archive.today is popular as a paywall bypass because nobody wants HN to devolve into debate based on a headline where nobody has rtfa.

  • > Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators

    It's not promoted, it's just used as a paywall bypass so everyone can read the linked article.

I believe there are multiple options with different degree of "half-baked"-ness, but can anyone name the best self-hosted version of this service?

Ultimately, what we all use it for is pretty straight-forward, and it seems like by now we should've arrived at having approximately one best implementation, which could be used both for personal archiving and for iternet-facing instances (perhaps even distributed). But I don't know if we have.

Sounds like there's a gap in the market for a "commons" archive... maybe powered by something p2p like BitTorrent protocol?

This would have sounded Very Normal in the 2000s... I wonder if we can go back :)

  • P2p is generally bad for this usecase. P2P generally only works for keeping popular content around (content gets dropped when the last peer that cares disconnects). If the content was popular it wouldnt need to be archived in the first place.

Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.

  • > I figured that they have found an (automated) way to imitate Googlebot really well.

    If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.

    • There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.

      That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.

      3 replies →

  • I imagine accounts are the only way that archive.today works on sites like 404media.co that seem to have server sided paywalls. Similarly, twitter has a completely server sided paywall.

  • > I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

    The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

    I hope they haven't been stealing cookies from actual users through a botnet or something.

    • Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.

    • You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.

      It would be challenging to do with text, but is certainly doable with images - and articles contain those.

      3 replies →

  • > which is, of course, ridiculous.

    Why? in the world of web scrapping this is pretty common.

    • Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.

      Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.

      3 replies →

  • I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.

  • It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.

The FBI called out archive.today a couple months ago, there's clearly a campaign against them by the USA (4th Reich), which stands principally against any information repository they don't control or have influence over (its Russian owned). This is simply donors of the Trump regime who own media companies requesting this because its the primary way around paywalls for most people who know about it.

There is an enormous amount of stuff that is only on archive.today, including stuff that is otherwise gone forever. A mix of stuff that somebody only ever did archive.today on and not archive.org, and stuff that could only be archived on archive.today because archive.org fails on it.

Anything on twitter post-login-wall for one. A million only-semi-paywalled news articles for others. But mainly an unfathomably long tail.

It was extremely distressing when the admin started(?) behaving badly for this reason. That others are starting to react this way to it is understandable. What a stupid tragedy.

I noticed I've started being redirected to a blank nginx server for archive.is... but only the .is domain, .ph and .today work just fine. I wonder if they ended up on an adblocker or two.

  • There was some beef the site owner had with Cloudflare where if your were using Cloudflare DNS it wouldn’t serve anything to you? Is that still happening?

    Not sure why it would only be on archive.is and not the others but ‘is’ loads for me.

> If you want to pretend this never happened – delete your old article and post the new one you have promised. And I will not write “an OSINT investigation” on your Nazi grandfather

From hero to a Kremlin troll in five seconds.

> “I’m glad the Wikipedia community has come to a clear consensus, and I hope this inspires the Wikimedia Foundation to look into creating its own archival service,” he told us.

Hardly possible for Wikimedia to provide a service like archive.today given the legal trouble of the latter.

Strangely naive.

> an analysis of existing links has shown that most of its uses can be replaced.

Oh? Do tell!

  • I would be suprised if archive.today had something that was not in the wayback machine

    • Archive.today has just about everything the archived site doesn't want archived. Archive.org doesn't, because it lets sites delete archives.

    • I know that sometimes the behavior of each archiver service is a bit different. For example, it's possible that both Archive.today and the Internet Archive say they have a copy of a page, but then when you open up the IA version, you might see that it renders completely differently or not at all. It might be caused because the webpage has like two scrollbars, or maybe there's a redirect that happens when a link to the page is loaded. I notice this seems to happen on documentation pages that are hosted by Salesforce. It can be a bit of a pain if you want to save to save a backup copy online of a release note or something like that for everyone to easily reference in the future.

    • Wayback machine removes archives upon request, so there’s definitely stuff they don’t make publicly available (they may still have it).

      1 reply →

    • Trying to search the Wayback machine almost always gives me their made-up 498 error, and when I do get a result the interface for scrolling through dates is janky at best.

  • >> an analysis of existing links has shown that most of its uses can be replaced.

    >Oh? Do tell!

    They do. In the very next paragraph in fact:

       The guidance says editors can remove Archive.today links when the original 
       source is still online and has identical content; replace the archive link so 
       it points to a different archive site, like the Internet Archive, 
       Ghostarchive, or Megalodon; or “change the original source to something that 
       doesn’t need an archive (e.g., a source that was printed on paper)

So toward the end of last year, the FBI was after archive.today, presumably either for keeping track of things the current administration doesn't want tracked, or maybe for the paywall thing (on behalf of rich donors/IP owners). https://gizmodo.com/the-fbi-is-trying-to-unmask-the-registra...

That effort appears to have gone nowhere, so now suddenly archive.today commits reputational suicide? I don't suppose someone could look deeper into this please?

  • The archive.today operator claims on his blog that this was nothing major: https://lj.rossia.org/users/archive_today/

    > Regarding the FBI’s request, my understanding is that they were seeking some form of offline action from us — anything from a witness statement (“Yes, this page was saved at such-and-such a time, and no one has accessed or modified it since”) to operational work involving a specific group of users. These users are not necessarily associates of Epstein; among our users who are particularly wary of the FBI, there are also less frequently mentioned groups, such as environmental activists or right-to-repair advocates.

    > Since no one was physically present in the United States at that time, however, the matter did not progress further.

    > You already know who turned this request into a full-blown panic about “the FBI accusing the archive and preparing to confiscate everything.”

    Not sure who he's talking about there.

FYI, archive.today is NOT the Internet Archive/Wayback Machine.

  • I prefer archive.today because the Internet Archive’s Wayback Machine allows retrospective removals of archived pages. If a URL has already been crawled and archived, the site owner can later add that URL to robots.txt and request a re-crawl. Once the crawler detects the updated robots.txt, previously stored snapshots of that page can become inaccessible, even if they were captured before the rule was added.

    Unfortunately this happens more often than one would expect.

    I found this out when I preserved my very first homepage I made as a child on a free hosting service. I archived it on archive.org, and thought it would stay there forever. Then, in 2017 the free host changed the robots.txt, closed all services, and my treasured memory was forever gone from the internet. ;(

>In emails sent to Patokallio after the DDoS began, “Nora” from Archive.today threatened to create a public association between Patokallio’s name and AI porn and to create a gay dating app with Patokallio’s name.

Oh good. That's definitely a reasonable thing to do or think.

The raw sociopathy of some people. Getting doxxed isn't good, but this response is unhinged.

  • It's a reminder how fragile and tenuous are the connections between our browser/client outlays, our societal perceptions of online norms, and our laws.

    We live at a moment where it's trivially easy to frame possession of an unsavory (or even illegal) number on another person's storage media, without that person even realizing (and possibly, with some WebRTC craftiness and social engineering, even get them to pass on the taboo payload to others).

  • I mean, the admin of archive.today might face jail time if deanonymised, kind of understandable he's nervous. Meanwhile for Patokallio it's just curiosity and clicks

  • That was private negotiations, btw, not public statements.

    In response to J.P's blog already framed AT as project grown from a carding forum + pushed his speculations onto ArsTechnica, whose parent company just destroyed 12ft and is on to a new victim. The story is full of untold conflicts of interests covered with soap opera around DDoS.

[flagged]

  • > i don't know anything specific about the site or any conflicts involved, yet this smells like a negative PR campaign to me...

    What possible value could a comment from someone who has no knowledge of the site or conflict add to this discussion?

Anecdotally I generally see archive.is/archive.today links floating around "stochastic terrorist" sites and other hate cults.

They seem totally unrelated to the Internet Archive. They probably only ever got on Wikipedia by leeching of the IA brand and confusing enough people to use them

  • Wayback machine won't bypass paywall nor pirate content, not to mention they are under US jurisdiction. You can't have your cake and eat it.

    • Honestly, IMHO archive.today is just so much nicer to use in every aspect than IA, that unless they outright start to distribute malware (I mean, like, via the page itself — otherwise it's pretty much irrelevant), I don't think I'll stop using it.

At this point Archive.today provides a better service (all things considered) compared to Wikipedia, at least when it comes to current affairs.

Why not show both? Wikipedia could display archive links alongside original sources, clearly labeled so readers know which is which. This preserves access when originals disappear while keeping the primary source as the main reference.

  • Wikipedia shouldn't allow links to sites which intentionally falsify archived pages and use their visitors to perform DDOS attacks.

  • They generally do. Random example, citation 349 on the page of George Washington: ""A Brief History of GW"[link]. GW Libraries. Archived[link] from the original on September 14, 2019. Retrieved August 19, 2019."

I will no longer donate to Wikipedia as long as this is policy.