Comment by wongarsu

5 months ago

100GB doesn't sound that out of reach. It's expensive in a laptop, but in a desktop that's about $300 of RAM and our supported by many consumer mainboards. Hetzner will rent me a dedicated server with that amount of ram for $61/month.

If the payloads in question are in that range, the time spent to support streaming doesn't feel justified compared to just using a machine with more memory. Maybe reducing the size of the parsed representation would be worth it though, since that benefits nearly every use case

I just pulled the 100GB number out of nowhere, I have no idea how much overhead parsed xml consumes, it could be less or it could be more than 2.5x (it probably depends on the specific document in question).

In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.

  • In the case of Wikipedia dumps there is an easy work-around. The XML dump starts with a small header and "<siteinfo>" section. Then it's just millions of "page" documents for the Wiki pages.

    You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.

    I have processed Wikipedia many times with less than 8 GB of RAM.

  • Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML

  • Xml and textual formats in general are ill suited to such large documents. Step 1 should really be to convert and/or split the file into smaller parts.