Comment by jsheard

15 hours ago

Does Wikipedia really need to outsource this? They already do basically everything else in-house, even running their own CDN on bare metal, I'm sure they could spin up an archiver which could be implicitly trusted. Bypassing paywalls would be playing with fire though.

18 comments

jsheard

raincole 3 hours ago

> Does Wikipedia really need to outsource this?

I hope so. Archiving is a legal landmine.

toomuchtodo 14 hours ago

Archive.org is the archiver, rotted links are replaced by Archive.org links with a bot.

https://meta.wikimedia.org/wiki/InternetArchiveBot

https://github.com/internetarchive/internetarchivebot

jsheard 14 hours ago
Yeah for historical links it makes sense to fall back on IAs existing archives, but going forward Wikipedia could take their own snapshots of cited pages and substitute them in if/when the original rots. It would be more reliable than hoping IA grabbed it.
- toomuchtodo 14 hours ago
  
  Not opposed, Wikimedia tech folks are very accessible in my experience, ask them to make a GET or POST to https://web.archive.org/save whenever a link is added via the Wiki editing mechanism. Easy peasy. Example CLI tools are https://github.com/palewire/savepagenow and https://github.com/akamhy/waybackpy
  Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).
  
  10 replies →
snigsnog 7 hours ago
Archive.org are left wing activists that will agree to censor anything other left wing activists or large companies don't want online.
- Maken 3 hours ago
  
  Like what?

IshKebab 2 hours ago

Of course they do. If Wikipedia did it themselves they'd immediately get DMCA'd and sued into oblivion.

> Bypassing paywalls would be playing with fire though.

That's the only reason archive.today was used. For non-paywalled stuff you can use the wayback machine.