Comment by infogulch
5 months ago
I just pulled the 100GB number out of nowhere, I have no idea how much overhead parsed xml consumes, it could be less or it could be more than 2.5x (it probably depends on the specific document in question).
In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.
In the case of Wikipedia dumps there is an easy work-around. The XML dump starts with a small header and "<siteinfo>" section. Then it's just millions of "page" documents for the Wiki pages.
You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.
I have processed Wikipedia many times with less than 8 GB of RAM.
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
For Wikipedia, the bulk of the data in the XML is inside a "<text>" block that contains wikitext: https://en.wikipedia.org/wiki/Help:Wikitext
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.
Xml and textual formats in general are ill suited to such large documents. Step 1 should really be to convert and/or split the file into smaller parts.
Or shred into a proper database designed for that