Comment by hedora

25 days ago

It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

46 comments

hedora

toomuchtodo 25 days ago

Pick the items you want to mirror and seed them via their torrent file.

https://news.ycombinator.com/item?id=45559219

(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)

billyhoffman 25 days ago
There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.
https://www.reddit.com/r/torrents/comments/vc0v08/question_a...
The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.
Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...
- Wowfunhappy 25 days ago
  
  Is this something the Internet Archive could fix? I would have expected the torrent to get replaced when an upload is changed, maybe with some kind of 24 hour debounce.
  
  3 replies →

nodja 25 days ago

It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.

vetrom 25 days ago

The illegal side of hosting, sharing, and mirroring technology, as it were, is much more free to chase technical excellence at all costs.
There are lessons to be learned in that. For example, for that population, bandwidth efficiency and information leakage control invite solutions that are suboptimal for an organization that would build market share on licensing deals and growth maximization.
Without an overriding commercial growth directive you also align development incentives differently.
gosub100 24 days ago

I was hopeful a few years ago when I heard of chia coin, that it would allow distributed internet storage for a price.
Users upload their encrypted data to miners, along with a negotiated fee for a duration of storage, say 90d. They take specific hashes of the complete data, and some randomized sub hashes, of internal chunks. Periodically an agent requests these chunks, hashes and rewards a fraction of the payment of the hash is correct.
That's a basic sketch, more details would have to be settled. But "miners" would be free to delete data if payment was no longer available on a chain. Or additionally, they could be paid by downloaders instead of uploaders for hoarding more obscure chunks that aren't widely available.
fuzzer371 25 days ago
> a bunch of pervs
Not everyone who watches hentai is a perv
- account42 24 days ago
  
  Yeah sure, they are just into 9000 year old dragons...
- f33d5173 25 days ago
  
  Just don't look up what the word "hentai" means ;)
  
  1 reply →

qingcharles 25 days ago

The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.

Gormo 25 days ago
How is it "egregious" that people are obtaining content to use for their own purposes from a resource intentionally established as a repository of content for people to obtain and use for their own purposes?
- Intralexical 25 days ago
  
  Because nobody who opens a public library does so intending, nor consenting, for random companies to jam the entrance trying to cart off thousands of books solely to use for their own enrichment.
  https://xkcd.com/1499/
  
  3 replies →
- decremental 25 days ago
  
  [dead]
sailfast 25 days ago

Might be easier for them to just pay for the mirrors and do an on-site copy and move the data in a container?
That way they would provide some more value back to the community as a mirror?
astrange 25 days ago
Has any evidence been provided for this fact?
- textfiles 25 days ago
  
  They absolutely are.
  
  1 reply →
- stonogo 25 days ago
  
  Yes.

philipkglass 25 days ago

I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.

[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."

toomuchtodo 25 days ago

https://github.com/internetarchive/wayback/tree/master/wayba...

https://akamhy.github.io/waybackpy/

https://wiki.archiveteam.org/index.php/Restoring

philipkglass 25 days ago

Yes, there are documents and third party projects indicating that it has a free public API, but I haven't been able to get it to work. I presume that a paid API would have better availability and the possibility of support.

I just tried waybackpy and I'm getting errors with it too when I try to reproduce their basic demo operation:

  >>> from waybackpy import WaybackMachineSaveAPI
  >>> url = "https://nuclearweaponarchive.org"
  >>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
  >>> save_api = WaybackMachineSaveAPI(url, user_agent)
  >>> save_api.save()
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      save_api.save()
      ~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 210, in save
      self.get_save_request_headers()
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 99, in get_save_request_headers
      raise TooManyRequestsError(
      ...<4 lines>...
      )
  waybackpy.exceptions.TooManyRequestsError: Can not save 'https://nuclearweaponarchive.org'. Save request refused by the server. Save Page Now limits saving 15 URLs per minutes. Try waiting for 5 minutes and then try again.

1 reply →

986aignan 25 days ago

I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.

quux 25 days ago

Is running an IPFS node and pinning the internet archive's collections a good way to do this?

Gormo 25 days ago

> $25-30M per year is a lot for a non-profit

$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.

esseph 25 days ago

You have an extremely skewed view of the average nonprofit
Medium_Taco 25 days ago

You're being purposefully obtuse. Most non-profits don't function at scale (neither do they do best at scale). They serve their local community

anonnon 24 days ago

Facilitating mirroring seems like it would open up another can of liability worms for the IA, as well as, potentially, for those mirroring it. For example, they recently lost an appeal of a major lawsuit brought by book publishers. And then there's the Wayback Machine itself; who knows what they've hoovered up from the public internet over the years? Would you be comfortable mirroring that?

traceroute66 24 days ago

> $25-30M per year is a lot for a non-profit

First, whether IA or any other large non-profit/charity. When you are in the double-digit/triple-digit multi-million bracket, you are no longer a non-profit/charity. You are in effect a business with a non-profit status.

Whether IA or any other large entity, when you get to that size, you don't benefit from the "oh they are a poor non-profit" mindset IMHO.

To be able to spend $25-30M a year, you clearly have to have a solid revenue stream both immediate and in the pipeline, that's Finances 101. Therefore you are in a privileged and enviable position that small non-profits can only dream of.

Second, I would be curious to know how much of that is of their own doing.

By that I mean, its sure cute to be located in the former Christian Science church on Funston Avenue in San Francisco’s Richmond District.

But they could most likely save a lot of money if they were located in a carrier-neutral facility.

For example, instead of paying for expensive external fiber lines (no doubt multiple, due to redundancy), they would have large amounts of capacity available through simple cross-connects.

Similar on energy. Are they benefiting from the same economies of scale that a carrier-neutral facility does ?

I am not saying the way they are doing it is wrong. I'm just genuinely curious to know what premium they are paying for doing it like they are.

leoc 24 days ago

Probably the advantages of its location outweigh the extra costs for the IA. Having your datacentre sited on land and in a building you own, behind a non-shared front door, has legal advantages similar to the ones which drive organisations to keep their data centres on-premises. A distinctive location in a nice area of San Francisco probably helps to keep cultivating the goodwill of the SV tech industry and of local and state politicians. It's also an advantage to be within easy walking distance in a neighbourhood where people like the IA and would be inclined to go there and protest if government forces rolled up and started pushing their way inside. To be sure, I presume that 300 Funston Ave. also being a very pleasant workplace for senior IA people has something to do with why the Archive moved there and remains there; but remaining there seems justifiable for other reasons.
textfiles 24 days ago
This seems like a lot of zesty made-up assumptions.
And a lot of non-profits would be very very surprised to hear that once you cross the threshold of $9,999,999 costs, you are a business.
- traceroute66 23 days ago
  
  > This seems like a lot of zesty made-up assumptions.
  Nope.
  The second half of my post, anyone who has been seriously involved with large carrier-neutral facilities will likely agree with me.
  It is a fact that IA will be incurring a premium to DIY and as I quite clearly spelt out, I am NOT trying to say they are wrong, I am just genuinely curious as to what the premium they are paying is.
  Regarding my comment about large non-profits. This is from personal experience. Once they get to a certain size, non-profits do switch to a business mentality. You might not like that fact, but it is a fact. They will more often than not have management boards who are "competitively remunerated". They will almost always actively manage their spare cash (of which they will have a large surplus) in investment portfolios. Things will be budgeted and cost-centered just like in larger businesses. They will have in-house legal teams or external teams on retainer to write up philanthropic contracts and aggressively chase after donations people leave them in wills. etc. etc. etc. etc.
  You absolutely cannot place a large non-profit in the same mindset as your local community mom & pop non-profit that operates hand to mouth on a shoestring.
  That is why I discourage people donating to large non-profits. You might feel good donating $100. But in reality its a sum that wouldn't even be a rounding-error on their financial reports. And in the majority of cases most of your donation is more likely to contribute to management expenses than the actual cause.
  Large non-profits are more interested in large corporate philanthropic donations, preferably multi-year agreements. They have more than enough money for the immediate future (<=12–18 months), they want large chunks of future money in the pipeline and that's what the large philanthropic agreements give them.
  
  1 reply →

hinkley 25 days ago

I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.

xp84 25 days ago

Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”

lazylizard 18 days ago

i dunno why i keep imagining something like ipfs could help something like the internet archive...

skywhopper 25 days ago

Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.