Comment by infogulch
5 months ago
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
5 months ago
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
I myself actually had no idea succinct data structures existed until last December , but then I found a paper that used them in the context of XML. Just to be clear: it's 120% of the original size; as it stands this library still uses more memory than the original document, just not a lot of overhead. Normal tree libraries, even if the tree is immutable, take a parent pointer, and a first child pointer and next and previous sibling pointers per node. Even though some nodes can be stored more compactly it does add up.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
Would you be able to parse it in a streaming fashion and just store the structure of the document in memory, with just offsets for all of the string locations, and then re-read those from disk as needed?
With modern SSDs and disk cache, that's likely enough to be plenty performant without having to store the whole document in memory at once.