Wikipedia: Database Download

7 months ago (en.wikipedia.org)

Might also be worthwhile to download a pre ~2023 dump, because Low-background steel.

I wonder how easy it would be to make a practically indestructible, everlasting Wikipedia reader.

Something using solar for power, with a rugged and water-resistant enclosure, made of extremely high-quality components that won't break for hundreds of years at least. Maybe add an IRDA port for good measure, to make it possible to transfer all the data out somewhat quickly.

You could make hundreds of these and put them in hard-to-reach locations around the world, to make sure at least one survives whatever calamity might befall us in the future.

  • Kiwix has created pretty polished software for this: https://kiwix.org/

    My last download of English Wikipedia was ~110 GB and includes images! It's impressively small for the volume of information available.

  • You could even make it radiation tolerant by printing it.

  • Aard2 for Android exists since at least 2015:

    https://f-droid.org/packages/itkach.aard2

    I have many current and old dumps and can switch between a few years. Very nice in case of deleted articles or to check old time stamped versions. It also supports more than just Wikipedia like wikiquote or wikivoyage or cooking wiki. You can compile own mediawikis too

  • Some thoughts about making it possible for individual humans to access Wikipedia, robustly to calamities that are within the sphere of human agency.

    Seems like you would want it to be stored digitally. Ideally, people would have the ability to access it remotely, in case their local copy is somehow corrupted. For that, you would need a physical network by which the data can be transmitted. Economies of scale would seem to suggest that there would be one or a few entities that would “serve” the content to individuals who request it. Of course, you would want those individuals to be able to access this information without having detailed technical knowledge and ability. I guess they would have pre-packaged software “browsers” they could use to access the network.

    In order to maintain this arrangement, you would want enough political stability to allow for the physical upkeep of this infrastructure, including human infrastructure (feeding the engineers who make it all possible). In order to make it worthwhile, you would need people who want to access the information too. I suspect political stability, a sufficient abundance of the necessities for human life, and the political will to make sure that everyone’s needs are met so that they can safely be curious about the world would help here too.

    All of this requires sources of power. I suspect that a combination of nuclear power, solar/batteries, and geothermal energy would be sufficient and would avoid the problem of running out of fossil fuels at some point in the future. The nice side-effect here of reducing the impact of calamities exacerbated by the greenhouse effect.

    For the information to continue being relevant, you would have to update it with new knowledge, and correct inaccuracies. How best to accomplish this? Well, I guess you would need a systematic way to interrogate the causes behind the various effects we observe in the world. I would propose a system where people create hypotheses, and perform experiments that exclude the influence of as many factors as possible external to the phenomenon being studied. People would then share their findings, and I guess would critique each other’s arguments in a sort of “peer review” to try to come to a consensus. You would have to feed and provide for these people at a certain basic level to make sure they are comfortable and safe enough to continue doing this work. I guess you would want to encourage the value systems compatible with this method of interrogating the world.

    Just my 2 cents.

You can also get it as a .zim file for easy offline browsing with Kiwix.

The whole enchilada: https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_ma...

Other versions: https://library.kiwix.org/#lang=eng&category=wikipedia

  • It is useful software for offline use and emergencies. For those who may not know, apart from wikis, they also offer offline documentation(Linux distros like ArchWiki, libraries, etc.), medical libraries(Medicine Plus, etc.), and Stack Exchange.

  • Will the 2025 zim be available as well?

    • Main Kiwix dev in charge of scrapers (tools to create ZIM files, even if we do not really scrape technically speaking) here.

      We are working hard toward upgrading the Wikipedia ZIMs, but it is far from being an easy feat. I'm mostly solo on this, and far from dedicating 100% of my time to this, so it does not move very fast. We are quite close to being able to reach the goal however, probably only a matter of weeks now.

      Bonus: the tool will now get pretty good at making a ZIM of any Mediawiki, not only Wikimedia ones, we expect for instance to work on all Fandom wikis somewhere this year since there is significant knowledge over there.

      2 replies →

    • I am wondering the same thing. I have Jan 2021, Jan 2024... I want to keep a snapshot each year and I wonder why a new one hasn't been generated.

      I haven't looked for documentation on creating my own zim file.

      1 reply →

  • I tried this kiwix the other day, it has like a 300mb "essentials" text version that was interesting.

    This comment was downvoted and instead, it'd better merit a comment as to "why" it wasn't contributing to the discussion?

    • > I tried this kiwix the other day, it has like a 300mb "essentials" text version that was interesting.

      I didn't downvote the comment, but it's not an incredibly deep contribution, is it?

      If you really wish to contribute, perhaps you can say what the "'essentials' text version" contained and why you found it interesting?

If you've got some spare bandwidth & storage then seeding some of the torrents here is a cheap and fun way of helping Wikipedia out. I've served around 20TB of these dumps in the past year.

https://meta.wikimedia.org/wiki/Data_dump_torrents

  • Do you happen to know why wikipedia didn't embrace torrents as the default download method?

    • Speculating: because torrents are not especially good at dealing with small modifications?

      Most people probably won't seed many versions, so it's a losing effort, and you need to allocate a huge chunk of space for each version.

      Deduplicating filesystems are sadly not in vogue.

I more or less do this every year — grab the latest Kiwix, English version (about 100 GB or so). I keep the older ones as well.

Is there a RAG for Wikipedia?

I may not be using the term correctly here. In short, I would love a local LLM + Wikipedia snapshot so that I can have an offline, self-hosted ... Hitchhiker's Guide to Earth.

There are non English versions of Wikipedia also.

Can anyone please point to information on how we can download a copy of one specific language version?

Okay, that looked a bit ridiculous in the pre-AI era (who needs to download the whole Wikipedia?), but now I can see the sense in it.

  • Too bad the AI scrapers don't care, and are melting Wikipedia's production servers anyway.

    https://arstechnica.com/information-technology/2025/04/ai-bo...

    • I bet someone like Cloudflare could pull the dataset each day and serve up a plain text/Markdown version of Wikipedia for rounding error levels of spend. I just loaded a random Wikipedia page and it had a weight of 1.5MB in all for what I worked out would be about 30KB of Markdown (i.e. 50x less bandwidth).

      Of course, the problem then is getting all these scrapers and bots to actually use the alternative, but Wikimedia could potentially redirect suspected clients in that direction..

      6 replies →

    • I wonder if Wikipedias recent switch to client side rendering has hurt their performance too. Serving a prerendered page might have helped this situation. I don't know the details of their new system though.

    • Tragedy of the commons. And that’s why we can’t have nice things.

      Because people are people. And will always prioritize egotism over respect for the common good.

      2 replies →

  •     > who needs to download the whole Wikipedia
    

    Anyone archiving the site. Wikipedia is, for its faults, one of the best-curated collections of summarized human knowledge, probably in history.

    Replicating that knowledge helps build data resilience and protect it against all sorts of disasters. I used to seed their monthly data dump torrent for a while.

  • Oh no even now there is plenty of use for it outside of AI training. Just think of all the schools in villages all around the world that don't have access to the internet or have a very limited connection. I've worked with folks that would setup local "wikipedia servers" for schools so that kids could access Wikipedia via a local network connection. In other setups they just download all of wikipedia to a set of laptops and you use one of the offline readers to browse it.

    This is essentially the modern version of having a library of encyclopedias.

    • I'm thinking less about AI training and more about having a source of (reasonably) reliable information from the net, in case AI generated fake images and generated cross referenced texts start making it too difficult to discern real history from malicious rewrites. It's bad enough now, but can get much worse with the proliferation of AI agents.

      2 replies →

  • Also helps save Wikipedia if it gets shut down - which might happen!

    • True. Musk for example is publicly attacking it for spreading "left-wing lies" because in his wiki page there are statements like "He has been criticized for making unscientific and misleading statements, including COVID-19 misinformation and promoting conspiracy theories, and affirming antisemitic, racist, and transphobic comments." which are just pure facts.

      It would be nice to have something like this more decentralized.

      7 replies →

  • No idea when your pre-AI era begun, but I was much more excited to host Wikipedia locally 15 years ago than I am now.

  • > who needs to download the whole Wikipedia

    Anyone who wants to have access while off-line, for whatever reason. This can be as simple as saving costs via more complicated as accessing content from regions with spotty and/or expensive connectivity (you're on a ship out of reach of shore-based mobile networks, you do not have access to Starlink or something similar, you're deep in the jungle, deep underground, etc) to some prepper scenario where connectivity ends at the cave entry because the 'net has ceased to exist.

    I would like to have a less politically biased online encyclopedia for the latter scenario, it would be a shame to start a new society based on the same bad ideas which brought down the previous one. If ever a politically neutral LLM becomes available that'd be one of the first tasks I'd put it to: point out bias - any bias - in articles, encyclopedias and other 'sources' (yes, I know, WP is not an original source but for this purpose it is) of knowledge.

    • You don't need to be deep in the jungle. You might just not want to pay for mobile data. If your phone has an SD card slot, you can put in 1 TB of storage and have wikipedia, a lifetime of music, tons of books, an atlas of your country for GPS navigation, and plenty of room for taking photos/videos. Storage is cheap enough that mobile data should be basically pointless.

    • Genuine question, can you provide multiple explicit examples of such bias? I heard a lot of people railing against bias in Wikipedia, but no one provides any blatant examples of it.

      3 replies →

    • > based on the same bad ideas which brought down the previous one

      I don’t think that’s fair. Not that Wikipedia is without bias, but that their ivory tower biases are worlds apart from the lying brutal animalistic Hollywood signals herding the masses in “our democracy”.