Comment by infogulch
5 months ago
There are many humongous XML sources. E.g. the Wikipedia archive is 42GB of uncompressed text. Holding a fully parsed representation of it in memory would take even more, perhaps even >100GB which immediately puts this size of document out of reach.
The obvious solution is streaming, but streaming appears to not be supported, though is listed under Challenging Future Ideas: https://github.com/Paligo/xee/blob/main/ideas.md
How hard is it to implement XML/XSLT/XPATH streaming?
Anything could be supported with sufficient effort, but streaming hasn't been my priority so far and I haven't explored it in detail. I want to get XSLT 3.0 working properly first.
There's a potential alternative to streaming, though - succinct storage of XML in memory:
https://blog.startifact.com/posts/succinct/
I've built a succinct XML library named Xoz (not integrated into Xee yet):
https://github.com/Paligo/xoz
The parsed in memory overhead goes down to 20% of the original XML text in my small experiments.
There's a lot of questions on how this functions in the real world, but this library also has very interesting properties like "jump to the descendant with this tag without going through intermediaries".
> I want to get XSLT 3.0 working properly first
May I ask why? I used to do a lot of XSLT in 2007-2012 and stuck with XSLT 2.0. I don't know what's in 3.0 as I've never actually tried it but I never felt there was some feature missing from 2.0 that prevented me to do something.
As for streaming, an intermediary step would be the ability to cut up a big XML file in smaller ones. A big XML document is almost always the concatenation of smaller files (that's certainly the case for Wikipedia for example). If one can output smaller files, transform each of them, and then reconstruct the initial big file without ever loading it in full in memory, that should cover a huge proportion of "streaming" needs.
XSLT has been a goal of this project from the start, as my customer uses it. XSLT 3.0 simply as that's the latest specification. What tooling do you use for XSLT 2.0?
1 reply →
0.2x of the original size would certainly make big documents more accessible. I've heard of succinct storage, but not in the context of xml before, thanks for sharing!
I myself actually had no idea succinct data structures existed until last December , but then I found a paper that used them in the context of XML. Just to be clear: it's 120% of the original size; as it stands this library still uses more memory than the original document, just not a lot of overhead. Normal tree libraries, even if the tree is immutable, take a parent pointer, and a first child pointer and next and previous sibling pointers per node. Even though some nodes can be stored more compactly it does add up.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
1 reply →
"How hard is it to implement XML/XSLT/XPATH streaming?"
It's actually quite annoying on the general case. It is completely possible to write an XPath expression that says to match a super early tag on an arbitrarily-distant further tag.
In another post in this thread I mention how I think it's better to think of it as a multicursor, and this is part of why. XPath doesn't limit itself to just "descending", you can freely cursor up and down in the document as you write your expression.
So it is easy to write expressions where you literally can't match the first tag, or be sure you shouldn't return it, until the whole document has been parsed.
I think from a grammar side, XPath had made some decisions that make it really hard to generally implement it efficiently. About 10 years ago I was looking into binary XML systems and compiling stuff down for embedded systems realizing that it is really hard to e.g. create efficient transducers (in/out pushdown automata) for XSLT due to complexity of XPath.
Streaming is defined in the XSLT 3 spec: https://www.w3.org/TR/xslt-30/#streamability. When you want to use streaming, you are confined to a subset of XPath that is "guaranteed streamable", e.g. you can't just freely navigate the tree anymore. There are some special instructions in XSLT such as <xsl:merge> and <xsl:accumulator> that make it easier to collect your results.
Saxon's paid edition supports it. I've done it a few times, but you have to write your XSLT in a completely different way to make it work.
1 reply →
Would it be possible to transform a large XML document into something on-disk that could be queried like a database by the XPath evaluator?
Given the nature of this processing, I think even an NVMe-based disk storage would be awfully slow. (People often forget, or never realize, that the "gigabytes per second" that NVMe yields is for sequential access. Random access is quite a bit slower; still stomps spinning rust, but by much less. And this is going to be a random access sort of job, so we're in the "several multiples slower than RAM" regime of access.) This sort of thing really wants RAM, and even then, RAM with an eye towards cache coherency and other such performance considerations.
You'd basically be building an index into each node.
There's some fast databases that store prefix trees, which might be suitable for such a task actually (something like infinitydb). But building this database will basically take a while (it will require parsing the entire document). But i suppose if reading/querying is going to happen many times, its worth it?
It seems to me one could replace each text node with the offset. Perhaps limit it to longer instances?
Like MarkLogic?
100GB doesn't sound that out of reach. It's expensive in a laptop, but in a desktop that's about $300 of RAM and our supported by many consumer mainboards. Hetzner will rent me a dedicated server with that amount of ram for $61/month.
If the payloads in question are in that range, the time spent to support streaming doesn't feel justified compared to just using a machine with more memory. Maybe reducing the size of the parsed representation would be worth it though, since that benefits nearly every use case
I just pulled the 100GB number out of nowhere, I have no idea how much overhead parsed xml consumes, it could be less or it could be more than 2.5x (it probably depends on the specific document in question).
In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.
In the case of Wikipedia dumps there is an easy work-around. The XML dump starts with a small header and "<siteinfo>" section. Then it's just millions of "page" documents for the Wiki pages.
You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.
I have processed Wikipedia many times with less than 8 GB of RAM.
Shouldn't parsed XML be smaller than the raw uncompressed text? (as you could deduplicate strings). I'd expect that to be a significant saving for something like wikipedia in XML
1 reply →
Xml and textual formats in general are ill suited to such large documents. Step 1 should really be to convert and/or split the file into smaller parts.
1 reply →
I used to work for a NoSQL company that was more or less an XQUERY engine. One of the things we would complain about is we did use wikipedia as a test data set, so the standing joke was for those of us dealing with really big volumes we'd complain about 'only testing Wikipedia' sized things. Good times.
StackExchange also, not necessarily streamable but records are newline delimited which makes it easier to sample (at least the last time I worked with the Data Dump).
Is that all in one big document?
We regularly parse ~1GB XML documents at work, and got laughed at by someone I know who worked with bulk invoices when I called it a large XML file.
Not sure how common 100GB files are but I can certainly image that being the norm in certain niches.