Comment by jerf
5 months ago
"How hard is it to implement XML/XSLT/XPATH streaming?"
It's actually quite annoying on the general case. It is completely possible to write an XPath expression that says to match a super early tag on an arbitrarily-distant further tag.
In another post in this thread I mention how I think it's better to think of it as a multicursor, and this is part of why. XPath doesn't limit itself to just "descending", you can freely cursor up and down in the document as you write your expression.
So it is easy to write expressions where you literally can't match the first tag, or be sure you shouldn't return it, until the whole document has been parsed.
I think from a grammar side, XPath had made some decisions that make it really hard to generally implement it efficiently. About 10 years ago I was looking into binary XML systems and compiling stuff down for embedded systems realizing that it is really hard to e.g. create efficient transducers (in/out pushdown automata) for XSLT due to complexity of XPath.
Streaming is defined in the XSLT 3 spec: https://www.w3.org/TR/xslt-30/#streamability. When you want to use streaming, you are confined to a subset of XPath that is "guaranteed streamable", e.g. you can't just freely navigate the tree anymore. There are some special instructions in XSLT such as <xsl:merge> and <xsl:accumulator> that make it easier to collect your results.
Saxon's paid edition supports it. I've done it a few times, but you have to write your XSLT in a completely different way to make it work.
This was as I remember under development at the time. However, if working on bounded memory on communication buffers it, without remembering all the details, it was a pain not because of XSLT but mostly its interactions with XPATH. I was at the time formally looking into hedge grammars and visibly pushdown automata as formal basis, but to me it seemed at the time, that formal complexity was unnecessarily pushed beyond what was straightforward feasible. As I said it was about transforming binary (intermediate) representations of XML. Use case was actually to build message middlewares/routers for IoT stuff at the time. IMHO also the picked binary XML standards where mostly the wrong choice for small embedded systems (interestingly MPEG 7 is btw one of the few standards that supports rather nice streaming binary XML. I think however it is only used in digital broadcasting)
Would it be possible to transform a large XML document into something on-disk that could be queried like a database by the XPath evaluator?
Given the nature of this processing, I think even an NVMe-based disk storage would be awfully slow. (People often forget, or never realize, that the "gigabytes per second" that NVMe yields is for sequential access. Random access is quite a bit slower; still stomps spinning rust, but by much less. And this is going to be a random access sort of job, so we're in the "several multiples slower than RAM" regime of access.) This sort of thing really wants RAM, and even then, RAM with an eye towards cache coherency and other such performance considerations.
You'd basically be building an index into each node.
There's some fast databases that store prefix trees, which might be suitable for such a task actually (something like infinitydb). But building this database will basically take a while (it will require parsing the entire document). But i suppose if reading/querying is going to happen many times, its worth it?
It seems to me one could replace each text node with the offset. Perhaps limit it to longer instances?
Like MarkLogic?