Comment by petcat
8 hours ago
> XML Is a Cheap [...]
> XML is notoriously expensive to properly parse in many languages.
I'm glad this is the top comment. I have extensive experience in enterprise-y Java and XML and XML is anything but cheap. In fact, doing anything non-trivial with XML was regularly a memory and CPU bottleneck.
That's if you parse the into a DOM and work on that. If you use SAX parsing, it makes it much better regarding the memory footprint.
But of course, working with SAX parsing is yet another, very different, bag of snakes.
I still hope that json parsing had the same support for stream processing as XML (I know that there are existing solutions for that, but it's much less common than in the XML world)
In the context of the article, "cheap" means "easy to set up" not "computationally efficient." The article is making the argument that there are situations in which you benefit from sacrificing the latter in favor of the former. You're right that it's annoyingly slow to parse though and that does cause issues I'd like to fix.
If you want a parser that actually checks the XML spec and various edge cases, then parsing goes from human-readable config to O(n^2) string handling. The funny part is how often people silently accept partial or broken XML in prod because revisiting schema validation years later is a nightmare. If you want cheap parsing, you end up writing a regex or DOM walker and hoping for the best, which raises the question of why not just use JSON or invent a different DSL to start.
(Properly formatted) XML can be parsed, and streamed, by a visibly-pushdown automaton[1][2].
"Visibly Pushdown Expressions"[3] can simplify parsing with a terse syntax styled after regular expressions, and there's an extension to SQL which can query XML documents using VPAs[4].
JSON can also be parsed and validated with visibly pushdown automata. There's an interesting project[5] which aims to automatically produce a VPA from a JSON-schema to validate documents.
In theory these should be able outperform parsers based on deterministic pushdown automata (ie, (LA)LR parsers), but they're less widely used and understood, as they're much newer than the conventional parsing techniques and absent from the popular literature (Dragon Book, EAC etc).
[1]:https://madhu.cs.illinois.edu/www07.pdf
[2]:https://www.cis.upenn.edu/~alur/Cav14.pdf
[4]:https://web.cs.ucla.edu/~zaniolo/papers/002_R13.pdf
[3]:https://homes.cs.aau.dk/~srba/courses/MCS-07/vpe.pdf
[5]:https://www.gaetanstaquet.com/ValidatingJSONDocumentsWithLea...
Without looking, I guessed that all your quotes come from academic papers. I was right.
Because real life is nothing like what is taught in CS classes.
1 reply →
Yup. SAP and their glorious idocs with german acronyms