Comment by sramsay

5 months ago

I don't think people realize just how important XML is in this space (complex documentary editing, textual criticism, scholarly full-text archives in the humanities). JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.

Nothing could compel me to like XSLT. I admire certain elements of its design, but in practice, it just seems needlessly verbose. But I really love XPath, though.

XML is great for documents.

If your data is essentially a long piece of text, with annotations associated with certain parts of that text, this is where XML shines.

When you try to use XML to represent something like an ecommerce order, financial transaction, instant message and so on, this is where you start to see problems. Trying to shove some extremely convoluted representation of text ranges and their attributes into JSON is just as bad.

A good "rule of thumb" would be "does this document still make sense if all the tags are stripped, and only the text nodes remain?" If yes, choose XML, if not, choose JSON.

  • XML is honestly the greatest and I'm not sure why it didn't take off. People sometimes ask me, "what impacted the humanity the most - electricity? antibiotics? combustion engines?" -- no, no, and no, it was XML. Everything can be expressed in XML, and basically everything can read and write XML. It's like the whole world could read and write the same files. Imagine what if those files included programs, that's what XSLT is, a program that's a file of the XML format that performs transformations between XML format and XML format. Wow - now everything can read and write your programming language! About 90% of it is usually around a capacity to use XML to document your XML to XML transforming XML code, and then the other 9% is boilerplate, 1% does the lifting. Brilliant. Imagine a more verbose java, for those of us who find java terse, it almost feels like assembly to me. XML is like the tower of babel that unites all of humanity and JSON is the devil that shattered that dream.

    • Maybe one reason is its verbosity for small everyday tasks, like config files or when representing arrays. If xml allowed empty tags there probably would be no need for json.

      2 replies →

What actually prevents JSON from being used in these spaces? It seems to me that any XML structure can be represented in JSON. Personally, I've yet to come across an XML document I didn't wish was JSON, but perhaps in spaces I haven't worked with, it exists.

  • > It seems to me that any XML structure can be represented in JSON

    Well it can't: JSON has no processing instructions, no references, no comments, JSON "numbers" are problematic, and JSON arrays can't have attributes, so you're stuck with some kind of additional protocol that maps the two.

    For something that is basically text (like an HTML document) or a list of dictionaries (like RSS) it may not seem obvious what the value of these things are (or even what they mean, if you have little exposure to XML), so I'll try and explain some of that.

    1. Processing instructions are like <?xml?> and <?xml-stylesheet?> -- these let your application embed linear processing instructions that you know are for the implementation, and so you know what your implementation needs to do with the information: If it doesn't need to do anything, you can ignore them easily, because they are (parsewise) distinct.

    2. References (called entities) are created with <!ENTITY x ...> and then you use them as &#x; maybe you are familiar with &lt; representing < but this is not mere string replacement: you can work with the pre-parsed entity object (for example, if it's an image), or treat it as a reference (which can make circular objects possible to represent in XML) neither of which is possible in JSON. Entities can be behind external URI as well.

    3. Comments are for humans. Lots of people put special {"comment":"xxx"} objects in their JSON, so you need to understand that protocol and filter it. They are obvious (like the processing instructions) in XML.

    4. JSON numbers fold into floats of different sizes in different implementations, so you have to avoid them in interchange protocols. This is annoying and bug-prone.

    5. Attributes are the things on xml tags <foo bar="42">...</foo> - Some people map this in JSON as {"bar":"42","children":[...],"tag":"foo"} and others like ["foo",{"bar":"42"},...] but you have to make a decision -- the former may be difficult to parse in a streaming way, but the latter creates additional nesting levels.

    None of this is insurmountable: You can obviously encapsulate almost anything in almost anything else, but think about all the extra work you're doing, and how much risk there is in that code working forever!

    For me: I process financial/business data mostly in XML, so it is very important I am confident my implementation is correct, because shit happens as the result of that document getting to me. Having the vendor provide a spec any XML software can understand helps us have a machine-readable contract, but I am getting a number of new vendors who want to use JSON, and I will tell you their APIs never work: They will give me openapi and swagger "templates" that just don't validate, and type-coding always requires extra parsing of the strings the JSON parsing comes back with. If there's a pager interface: I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.

    So I am always finding JSON that I wish were XML, because (in my use-cases) XML is just plain better than JSON, but if you do a lot in languages with poor XML support (like JavaScript, Python, etc) all of these things will seem hard enough you might think json+xyz is a good alternative (especially if you like JSON), so I understand the need for stuff like "xee" to make XML more accessible so that people stop doing so much with JSON. I don't know rust well enough to know if xee does that, but I understand fully the need.

    • ><!ENTITY x ...> and then you use them as &#x; maybe you are familiar with &lt; representing <

      Okay. This is syntactically painful, APL or J tier. C++ just uses "&" to indicate a reference. That's a lot of people's issue with XML, you get the syntactic pain of APL with the verbosity pain of Java.

      > I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.

      Special logic is built into every real-world programming scenario ever. It just means the programmer had to diverge from ideal to make something work. Unpleasant but vanilla and common. I don't see how XML magically solved the date issue forever. For example, I could just toss in <date>UNIXtime</date> or <date time=microseconds since 1997>324234234</date> or <datecontainer><measurement units="femtoseconds since 1776"><value>3234234234234</value></measurement></datecontainer>. The argument seems to be "ah yes, but if everyone uses this XML date feature it's solved!" but not so. It's a special case of "if everyone did the same thing, it would be solved". But nobody does the same thing.

      1 reply →

    • I think I can see something of where you're coming from. But a question:

      You complain about dates in JSON (really a specific case of parsing text in JSON):

      > If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from > that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an > HTTP date. And so on.

      Sure, but does not XML have the exact same problem because everything is just a text?

      2 replies →

  • Have you ever written Markdown? Markdown is typically mostly human-readable text, interspersed with occasional formatting instructions. That's what XML is good for, except that it's more verbose but also considerably more flexible, more precise, and more powerful. Sure, you can losslessly translate any structural format into almost any other structural format, but that doesn't mean that working with the latter format will be as convenient or as efficient as working with the former.

    XML can really shine in the markup role. It got such a bad rap because people used it as a pure data format, something it isn't very suited for.

  • <p>How would you represent <b><i>mixed content</i></b> in JSON?</p>

    • `{ type: "p", children: [{type: "text", text: "How would you represent "}, {type: "b", children: [{type: "i", children: [{type: "text", text: "mixed content"}]], {type: "text", text: " in JSON?"]}`

      or:

      `{paragraphs: [{spans: [{ text: "How you represent "}, {bold: true, italic: true, text: "mixed content"},{text: " in JSON?"}]}`

      4 replies →

  • in addition to all the things listed above, json has no practical advantage. json offers no compelling feature that would make anyone switch. what would be gained?

>JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.

```js import * as fastXmlParser from 'fast-xml-parser'; const xmlParser = new fastXmlParser.XMLParser({ ignoreAttributes: false }); ```

Validate input as required with jschema.