It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.
I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.
The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.
(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)
There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.
The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.
Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...
It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.
The illegal side of hosting, sharing, and mirroring technology, as it were, is much more free to chase technical excellence at all costs.
There are lessons to be learned in that. For example, for that population, bandwidth efficiency and information leakage control invite solutions that are suboptimal for an organization that would build market share on licensing deals and growth maximization.
Without an overriding commercial growth directive you also align development incentives differently.
I was hopeful a few years ago when I heard of chia coin, that it would allow distributed internet storage for a price.
Users upload their encrypted data to miners, along with a negotiated fee for a duration of storage, say 90d. They take specific hashes of the complete data, and some randomized sub hashes, of internal chunks. Periodically an agent requests these chunks, hashes and rewards a fraction of the payment of the hash is correct.
That's a basic sketch, more details would have to be settled. But "miners" would be free to delete data if payment was no longer available on a chain. Or additionally, they could be paid by downloaders instead of uploaders for hoarding more obscure chunks that aren't widely available.
How is it "egregious" that people are obtaining content to use for their own purposes from a resource intentionally established as a repository of content for people to obtain and use for their own purposes?
I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.
[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."
I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.
$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.
Facilitating mirroring seems like it would open up another can of liability worms for the IA, as well as, potentially, for those mirroring it. For example, they recently lost an appeal of a major lawsuit brought by book publishers. And then there's the Wayback Machine itself; who knows what they've hoovered up from the public internet over the years? Would you be comfortable mirroring that?
First, whether IA or any other large non-profit/charity. When you are in the double-digit/triple-digit multi-million bracket, you are no longer a non-profit/charity. You are in effect a business with a non-profit status.
Whether IA or any other large entity, when you get to that size, you don't benefit from the "oh they are a poor non-profit" mindset IMHO.
To be able to spend $25-30M a year, you clearly have to have a solid revenue stream both immediate and in the pipeline, that's Finances 101. Therefore you are in a privileged and enviable position that small non-profits can only dream of.
Second, I would be curious to know how much of that is of their own doing.
By that I mean, its sure cute to be located in the former Christian Science church on Funston Avenue in San Francisco’s Richmond District.
But they could most likely save a lot of money if they were located in a carrier-neutral facility.
For example, instead of paying for expensive external fiber lines (no doubt multiple, due to redundancy), they would have large amounts of capacity available through simple cross-connects.
Similar on energy. Are they benefiting from the same economies of scale that a carrier-neutral facility does ?
I am not saying the way they are doing it is wrong. I'm just genuinely curious to know what premium they are paying for doing it like they are.
Probably the advantages of its location outweigh the extra costs for the IA. Having your datacentre sited on land and in a building you own, behind a non-shared front door, has legal advantages similar to the ones which drive organisations to keep their data centres on-premises. A distinctive location in a nice area of San Francisco probably helps to keep cultivating the goodwill of the SV tech industry and of local and state politicians. It's also an advantage to be within easy walking distance in a neighbourhood where people like the IA and would be inclined to go there and protest if government forces rolled up and started pushing their way inside. To be sure, I presume that 300 Funston Ave. also being a very pleasant workplace for senior IA people has something to do with why the Archive moved there and remains there; but remaining there seems justifiable for other reasons.
Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”
They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.
That's sad, but it mirrors my experience with commercial customers. Tape is so fiddly but the cost efficiency for large amounts of data and at-rest stability is so good. Tape is caught in a spiral of decreasing market share so industry has no incentive to optimize it.
Edit: Then again, I recently heard a podcast that talked about the relatively good at-rest stability of SATA hard disk drives stored outdoors. >smile<
We had a little server room where the AC was mounted directly over the rack. I don't think we ever put an umbrella in there but it sure made everyone nervous the drain pipe would clog.
Much more recently, I worked at a medium-large SaaS company but if you listened to my coworkers you'd think we were Google (there is a point where optimism starts being delusion, and a couple of my coworkers were past it.)
Then one day I found the telemetry pages for Wikipedia. I am hoping some of those charts were per hour not per second, otherwise they are dealing with mind numbing amounts of traffic.
The table also seems like the kind of thing that Gemini seems to generate a lot. "Here's a table that communicates almost no information! One of the rows is constant for each item."
I think relying on the vocabulary to indicate AI is pointless (unless they're actually using words that AI made up). There's a reason they use words such as those you've pointed out: because they're words, and their training material (a.k.a. output by humans) use them.
I love to imagine this is all a cover and the Internet Archive is located in a remote cave in northern Sweden and consists of a series of endlessly self replicating flash drives powered by the sun.
Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.
"Here, amidst the repurposed neoclassical columns and wooden pews of a building constructed to worship a different kind of permanence, lies the physical manifestation of the "virtual" world. We tend to think of the internet as an ethereal cloud, a place without geography or mass. But in this building, the internet has weight. It has heat. It requires electricity, maintenance, and a constant battle against the second law of thermodynamics. As of late 2025, this machine—collectively known as the Wayback Machine—has archived over one trillion web pages.1 It holds 99 petabytes of unique data, a number that expands to over 212 petabytes when accounting for backups and redundancy.3"
can you help my small brain by pointing out where in this paragraph they talk about deduplication?
We absolutely lap them with many, many more petabytes of material. But archive.today is also not doing speculative or multiple scheduled captures of the amount of sites that archive.org is.
"Inside the church's main room, with its still-intact pews, there are more than 120 ceramic sculptures of the Internet Archive's current and former employees, created by artist Nuala Creed and inspired by the statues of the Xian warriors in China."
I have always wondered how archives manage to capture screenshots of paywalled pages like the New York Times or the Wall Street Journal. Do they have agreements with publishers, do their crawlers have special privileges to bypass detection, or do they use technology so advanced that companies cannot detect them?
Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.
An article about "infrastructure" that opens up with a dramatic description of a datacenter stuffed into an old church, I would expect more than just generic clipart you'd see in the back half of Wired magazine.
Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.
It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.
I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.
The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.
Pick the items you want to mirror and seed them via their torrent file.
https://news.ycombinator.com/item?id=45559219
(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)
There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.
https://www.reddit.com/r/torrents/comments/vc0v08/question_a...
The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.
Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...
5 replies →
It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.
The illegal side of hosting, sharing, and mirroring technology, as it were, is much more free to chase technical excellence at all costs.
There are lessons to be learned in that. For example, for that population, bandwidth efficiency and information leakage control invite solutions that are suboptimal for an organization that would build market share on licensing deals and growth maximization.
Without an overriding commercial growth directive you also align development incentives differently.
I was hopeful a few years ago when I heard of chia coin, that it would allow distributed internet storage for a price.
Users upload their encrypted data to miners, along with a negotiated fee for a duration of storage, say 90d. They take specific hashes of the complete data, and some randomized sub hashes, of internal chunks. Periodically an agent requests these chunks, hashes and rewards a fraction of the payment of the hash is correct.
That's a basic sketch, more details would have to be settled. But "miners" would be free to delete data if payment was no longer available on a chain. Or additionally, they could be paid by downloaders instead of uploaders for hoarding more obscure chunks that aren't widely available.
> a bunch of pervs
Not everyone who watches hentai is a perv
3 replies →
The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.
How is it "egregious" that people are obtaining content to use for their own purposes from a resource intentionally established as a repository of content for people to obtain and use for their own purposes?
5 replies →
Might be easier for them to just pay for the mirrors and do an on-site copy and move the data in a container?
That way they would provide some more value back to the community as a mirror?
Has any evidence been provided for this fact?
3 replies →
I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.
[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."
https://github.com/internetarchive/wayback/tree/master/wayba...
https://akamhy.github.io/waybackpy/
https://wiki.archiveteam.org/index.php/Restoring
2 replies →
I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.
Is running an IPFS node and pinning the internet archive's collections a good way to do this?
> $25-30M per year is a lot for a non-profit
$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.
You have an extremely skewed view of the average nonprofit
You're being purposefully obtuse. Most non-profits don't function at scale (neither do they do best at scale). They serve their local community
Facilitating mirroring seems like it would open up another can of liability worms for the IA, as well as, potentially, for those mirroring it. For example, they recently lost an appeal of a major lawsuit brought by book publishers. And then there's the Wayback Machine itself; who knows what they've hoovered up from the public internet over the years? Would you be comfortable mirroring that?
i dunno why i keep imagining something like ipfs could help something like the internet archive...
> $25-30M per year is a lot for a non-profit
First, whether IA or any other large non-profit/charity. When you are in the double-digit/triple-digit multi-million bracket, you are no longer a non-profit/charity. You are in effect a business with a non-profit status.
Whether IA or any other large entity, when you get to that size, you don't benefit from the "oh they are a poor non-profit" mindset IMHO.
To be able to spend $25-30M a year, you clearly have to have a solid revenue stream both immediate and in the pipeline, that's Finances 101. Therefore you are in a privileged and enviable position that small non-profits can only dream of.
Second, I would be curious to know how much of that is of their own doing.
By that I mean, its sure cute to be located in the former Christian Science church on Funston Avenue in San Francisco’s Richmond District.
But they could most likely save a lot of money if they were located in a carrier-neutral facility.
For example, instead of paying for expensive external fiber lines (no doubt multiple, due to redundancy), they would have large amounts of capacity available through simple cross-connects.
Similar on energy. Are they benefiting from the same economies of scale that a carrier-neutral facility does ?
I am not saying the way they are doing it is wrong. I'm just genuinely curious to know what premium they are paying for doing it like they are.
Probably the advantages of its location outweigh the extra costs for the IA. Having your datacentre sited on land and in a building you own, behind a non-shared front door, has legal advantages similar to the ones which drive organisations to keep their data centres on-premises. A distinctive location in a nice area of San Francisco probably helps to keep cultivating the goodwill of the SV tech industry and of local and state politicians. It's also an advantage to be within easy walking distance in a neighbourhood where people like the IA and would be inclined to go there and protest if government forces rolled up and started pushing their way inside. To be sure, I presume that 300 Funston Ave. also being a very pleasant workplace for senior IA people has something to do with why the Archive moved there and remains there; but remaining there seems justifiable for other reasons.
This seems like a lot of zesty made-up assumptions.
And a lot of non-profits would be very very surprised to hear that once you cross the threshold of $9,999,999 costs, you are a business.
3 replies →
I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.
Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”
Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.
They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.
There is a fundamental resistance to tape technology that exists to this day as a result of all those troubles.
That's sad, but it mirrors my experience with commercial customers. Tape is so fiddly but the cost efficiency for large amounts of data and at-rest stability is so good. Tape is caught in a spiral of decreasing market share so industry has no incentive to optimize it.
Edit: Then again, I recently heard a podcast that talked about the relatively good at-rest stability of SATA hard disk drives stored outdoors. >smile<
14 replies →
We had a little server room where the AC was mounted directly over the rack. I don't think we ever put an umbrella in there but it sure made everyone nervous the drain pipe would clog.
Much more recently, I worked at a medium-large SaaS company but if you listened to my coworkers you'd think we were Google (there is a point where optimism starts being delusion, and a couple of my coworkers were past it.)
Then one day I found the telemetry pages for Wikipedia. I am hoping some of those charts were per hour not per second, otherwise they are dealing with mind numbing amounts of traffic.
Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.
I was thinking the same thing. No proofreading is a sure sign to me. I also feel like I've read parts of this before.
Some of the images are AI generated (see the Gemini watermark in the bottom right), and the final paragraph also reads extremely AI-generated.
The table also seems like the kind of thing that Gemini seems to generate a lot. "Here's a table that communicates almost no information! One of the rows is constant for each item."
Maybe, but I was trying to find the original source of this article and couldn’t, at least not cursorily.
I already stopped when I saw the AI-gen image
I think this was writen wholly by deep research.
It just reads like a clunky low quality article
It's clearly AI writing ("hum", "delve") but oddly I don't think deep research models use those words.
I think relying on the vocabulary to indicate AI is pointless (unless they're actually using words that AI made up). There's a reason they use words such as those you've pointed out: because they're words, and their training material (a.k.a. output by humans) use them.
6 replies →
I love to imagine this is all a cover and the Internet Archive is located in a remote cave in northern Sweden and consists of a series of endlessly self replicating flash drives powered by the sun.
This article is way too LLMey for my taste.
IA is hosting a couple more of Rick Prelinger’s shows this month. Looking forward to visiting
Does IA do deduplication?
Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.
[flagged]
heres the second paragraph in full:
"Here, amidst the repurposed neoclassical columns and wooden pews of a building constructed to worship a different kind of permanence, lies the physical manifestation of the "virtual" world. We tend to think of the internet as an ethereal cloud, a place without geography or mass. But in this building, the internet has weight. It has heat. It requires electricity, maintenance, and a constant battle against the second law of thermodynamics. As of late 2025, this machine—collectively known as the Wayback Machine—has archived over one trillion web pages.1 It holds 99 petabytes of unique data, a number that expands to over 212 petabytes when accounting for backups and redundancy.3"
can you help my small brain by pointing out where in this paragraph they talk about deduplication?
I don't think the article mentions anything about deduplication. Can you be less snarky and actually quote the relevant sentence?
Wow that piece of real-estate has to cost a bundle.
Thanks for this, I've always wondered how the Archive operates but always ended up not searching.
Is it still year 2006 and websites haven’t figured out responsive design?
Does any one know how the size of this compares to archive.today?
We absolutely lap them with many, many more petabytes of material. But archive.today is also not doing speculative or multiple scheduled captures of the amount of sites that archive.org is.
How long will it take for them to send the PetaBox to space?
That project gets discussed every once in a while.
The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.
They can offer a perk that literally no other tech job can offer: Someday have a statue of your likeness preserved in ceramic: https://www.atlasobscura.com/places/internet-archive-headqua...
"Inside the church's main room, with its still-intact pews, there are more than 120 ceramic sculptures of the Internet Archive's current and former employees, created by artist Nuala Creed and inspired by the statues of the Xian warriors in China."
We've hired a few dozen people over the past couple of years. We think they're pretty talented.
Is retreival from the wayback machine intentionally made slow?
5 replies →
I have always wondered how archives manage to capture screenshots of paywalled pages like the New York Times or the Wall Street Journal. Do they have agreements with publishers, do their crawlers have special privileges to bypass detection, or do they use technology so advanced that companies cannot detect them?
Disappointed with the lack of pictures.
Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.
I wasn't expecting to read a podcast when clicking.
What do you want some pictures of?
An article about "infrastructure" that opens up with a dramatic description of a datacenter stuffed into an old church, I would expect more than just generic clipart you'd see in the back half of Wired magazine.
8 replies →
Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.
Same for me, Safari iOS 18.7.1 no accessibility font size set, no browsers font size set.
FWIW, it's the same for me on FF Android.
It's an AI-generated article. It's going to be pretty terrible.
this is every data hoarders dream setup haha
[flagged]
[flagged]
Was this reply meant for this story instead? https://news.ycombinator.com/item?id=46637127
>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.
I'd say the nonprofit has found itself a profitable reason for its existence