Comment by infogulch
10 months ago
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
10 months ago
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
I myself actually had no idea succinct data structures existed until last December , but then I found a paper that used them in the context of XML. Just to be clear: it's 120% of the original size; as it stands this library still uses more memory than the original document, just not a lot of overhead. Normal tree libraries, even if the tree is immutable, take a parent pointer, and a first child pointer and next and previous sibling pointers per node. Even though some nodes can be stored more compactly it does add up.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
Would you be able to parse it in a streaming fashion and just store the structure of the document in memory, with just offsets for all of the string locations, and then re-read those from disk as needed?
With modern SSDs and disk cache, that's likely enough to be plenty performant without having to store the whole document in memory at once.