Comment by nicoburns
10 months ago
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
10 months ago
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
For Wikipedia, the bulk of the data in the XML is inside a "<text>" block that contains wikitext: https://en.wikipedia.org/wiki/Help:Wikitext
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.