Comment by Finnucane

5 months ago

I've worked on archive projects with complex TEI xml files (which is why when people say xml is bad and it should be all json or whatever, I just LOL), and fortunately, my employer will pay for me to have an editor (Oxygen) that includes the enterprise version of Saxon and other goodies. An open-source xml processing engine that wasn't decades out of date would be a big deal in the digital humanities world.

I don't think people realize just how important XML is in this space (complex documentary editing, textual criticism, scholarly full-text archives in the humanities). JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.

Nothing could compel me to like XSLT. I admire certain elements of its design, but in practice, it just seems needlessly verbose. But I really love XPath, though.

  • XML is great for documents.

    If your data is essentially a long piece of text, with annotations associated with certain parts of that text, this is where XML shines.

    When you try to use XML to represent something like an ecommerce order, financial transaction, instant message and so on, this is where you start to see problems. Trying to shove some extremely convoluted representation of text ranges and their attributes into JSON is just as bad.

    A good "rule of thumb" would be "does this document still make sense if all the tags are stripped, and only the text nodes remain?" If yes, choose XML, if not, choose JSON.

    • XML is honestly the greatest and I'm not sure why it didn't take off. People sometimes ask me, "what impacted the humanity the most - electricity? antibiotics? combustion engines?" -- no, no, and no, it was XML. Everything can be expressed in XML, and basically everything can read and write XML. It's like the whole world could read and write the same files. Imagine what if those files included programs, that's what XSLT is, a program that's a file of the XML format that performs transformations between XML format and XML format. Wow - now everything can read and write your programming language! About 90% of it is usually around a capacity to use XML to document your XML to XML transforming XML code, and then the other 9% is boilerplate, 1% does the lifting. Brilliant. Imagine a more verbose java, for those of us who find java terse, it almost feels like assembly to me. XML is like the tower of babel that unites all of humanity and JSON is the devil that shattered that dream.

      3 replies →

  • What actually prevents JSON from being used in these spaces? It seems to me that any XML structure can be represented in JSON. Personally, I've yet to come across an XML document I didn't wish was JSON, but perhaps in spaces I haven't worked with, it exists.

    • > It seems to me that any XML structure can be represented in JSON

      Well it can't: JSON has no processing instructions, no references, no comments, JSON "numbers" are problematic, and JSON arrays can't have attributes, so you're stuck with some kind of additional protocol that maps the two.

      For something that is basically text (like an HTML document) or a list of dictionaries (like RSS) it may not seem obvious what the value of these things are (or even what they mean, if you have little exposure to XML), so I'll try and explain some of that.

      1. Processing instructions are like <?xml?> and <?xml-stylesheet?> -- these let your application embed linear processing instructions that you know are for the implementation, and so you know what your implementation needs to do with the information: If it doesn't need to do anything, you can ignore them easily, because they are (parsewise) distinct.

      2. References (called entities) are created with <!ENTITY x ...> and then you use them as &#x; maybe you are familiar with &lt; representing < but this is not mere string replacement: you can work with the pre-parsed entity object (for example, if it's an image), or treat it as a reference (which can make circular objects possible to represent in XML) neither of which is possible in JSON. Entities can be behind external URI as well.

      3. Comments are for humans. Lots of people put special {"comment":"xxx"} objects in their JSON, so you need to understand that protocol and filter it. They are obvious (like the processing instructions) in XML.

      4. JSON numbers fold into floats of different sizes in different implementations, so you have to avoid them in interchange protocols. This is annoying and bug-prone.

      5. Attributes are the things on xml tags <foo bar="42">...</foo> - Some people map this in JSON as {"bar":"42","children":[...],"tag":"foo"} and others like ["foo",{"bar":"42"},...] but you have to make a decision -- the former may be difficult to parse in a streaming way, but the latter creates additional nesting levels.

      None of this is insurmountable: You can obviously encapsulate almost anything in almost anything else, but think about all the extra work you're doing, and how much risk there is in that code working forever!

      For me: I process financial/business data mostly in XML, so it is very important I am confident my implementation is correct, because shit happens as the result of that document getting to me. Having the vendor provide a spec any XML software can understand helps us have a machine-readable contract, but I am getting a number of new vendors who want to use JSON, and I will tell you their APIs never work: They will give me openapi and swagger "templates" that just don't validate, and type-coding always requires extra parsing of the strings the JSON parsing comes back with. If there's a pager interface: I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.

      So I am always finding JSON that I wish were XML, because (in my use-cases) XML is just plain better than JSON, but if you do a lot in languages with poor XML support (like JavaScript, Python, etc) all of these things will seem hard enough you might think json+xyz is a good alternative (especially if you like JSON), so I understand the need for stuff like "xee" to make XML more accessible so that people stop doing so much with JSON. I don't know rust well enough to know if xee does that, but I understand fully the need.

      5 replies →

    • Have you ever written Markdown? Markdown is typically mostly human-readable text, interspersed with occasional formatting instructions. That's what XML is good for, except that it's more verbose but also considerably more flexible, more precise, and more powerful. Sure, you can losslessly translate any structural format into almost any other structural format, but that doesn't mean that working with the latter format will be as convenient or as efficient as working with the former.

      XML can really shine in the markup role. It got such a bad rap because people used it as a pure data format, something it isn't very suited for.

    • in addition to all the things listed above, json has no practical advantage. json offers no compelling feature that would make anyone switch. what would be gained?

      1 reply →

  • >JSON cannot be used for the kinds of tasks to which TEI is put. It's not even an option.

    ```js import * as fastXmlParser from 'fast-xml-parser'; const xmlParser = new fastXmlParser.XMLParser({ ignoreAttributes: false }); ```

    Validate input as required with jschema.

My hope is that we can get a little collective together that is willing to invest in this tooling, either with time or money. I didn't have much hope, but after seeing the positive response today more than before.

Oxygen was such a clunky application back when I used it for DH. But very powerful and the best tool in the game. Would love to see a modern tool that doesn't get in the way for all those poorly paid, overworked DH research assistants caffeinated in the dead of night banging out the tedious, often very manual, TEI-XML encoding work...