Comment by nicoburns
5 months ago
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
5 months ago
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
For Wikipedia, the bulk of the data in the XML is inside a "<text>" block that contains wikitext: https://en.wikipedia.org/wiki/Help:Wikitext
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.