Comment by velcrovan

10 months ago

You should try using a LISP like Racket for XML. Because XML can be expressed directly as S-expressions, XML and LISP go together like peanut butter and jelly.

    <greeting attr="val" href="#">Hello <thing>world</thing><greeting>

    (greeting ((attr "val") (href "#")) "Hello " (thing "world"))

9 comments

velcrovan

koito17 10 months ago

In my experience, at least with Clojure, it's much more convenient to serialize XML into a map-like structure. With your example, the data structure would look like so.

  {:tag     :greeting
   :attrs   {:href "#" :attr "val"}
   :content ["Hello" {:tag :thing :content ["world"]}]}

Some people use namespaced keywords (e.g. :xml/tag) to help disambiguate keys in the map. This kind of data structure tends to be more convenient than dealing with plain sexps or so-called "Hiccup syntax". i.e.

  [:greeting {:href "#" :attr "val"} "Hello" [:thing "world"]]

The above syntax is convenient to write, but it's tedious to manipulate. For instance, one needs to dispatch on types to determine whether an element at some index is an attribute map or a child. By using the former data structure, one simply looks up the :attrs or :content key. Additionally, the map structure is easier to depth-first search; it's a one-liner with the tree-seq function.

I've written a rudimentary EPUB parser in Clojure and found it easier to work with zippers than any other data structure to e.g. look for <rootfile> elements with a <container> ancestor.

Zippers are available in most programming languages, thankfully, so this advantage is not really unique to Clojure (or another Lisp). However, I will agree that something like sexps (or Hiccup) is more convenient than e.g. JSX, since you are dealing with the native syntax of the language rather than introducing a compilation step and non-standard syntax.

velcrovan 10 months ago

I have not looked into the use of zippers for this purpose, but I will do so!
Racket has helper libraries like TxExpr (https://docs.racket-lang.org/txexpr/index.html) that make it pretty easy to manipulate S-expressions of this kind.

zoogeny 10 months ago

This looks like it loses the distinction between attributes and nested tags?

As in, I don't see a difference between `(attr "val")` which expresses an attribute key/value pair and `(thing "world")` which expresses a tag/content relationship. Even if I thought the rule might be "if the first element of the list is a list itself then it should be interpreted as a set of attribute key value pairs" then I would still be ambiguous with:

    (foo (bar "baz") "content")

which could serialize to either:

    <foo bar="baz">content</foo>

or:

    <foo><bar>baz</bar>content</foo>

In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.

shawn_w 10 months ago
There's no ambiguity. The first element is a symbol that's the name of a tag. If the second element is a list of two element symbol + string lists, it's the attributes. If it's one of the other recognized types, it's part of the contents of the tag.
See a grammar for the representation at https://docs.racket-lang.org/xml/index.html#%28def._%28%28li...
Most Scheme tools for working with XML use a different layout where a list starting with the symbol @ indicates attributes. See https://en.wikipedia.org/wiki/SXML for it.
- zoogeny 10 months ago
  
  I see, so my example should be:
  (foo (bar "baz") "content")
  vs
  (foo ((bar "baz")) "content")
  Where the first one would be the nested tags and the second one would be a single `bar="baz"` attribute.
  I would prefer the differentiation to be more explicit than the position and/or structure of the list, so the @ symbol modifier for the attribute list in other tools makes sense.
  The sibling comment with a map with a :attrs key feels even better. I don't work in languages with pattern matching or that kind of thing very often, but if I was wanting to know if a particular element had 1 or more attributes then being able to check a dictionary key just feels like a nicer kind of anchor point to match against.
immibis 10 months ago
> In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.
Just remember that it's a markup language, and then it's not head-scratching at all: the text is the text being marked up, and the attribute values are the attribute of the markup - things like colour and font.
When it was co-opted to store structured data, those people didn't obey this rule (which would make everything attributes).
Namespaces had a very cool use in XHTML: you could just embed an SVG or MathML directly in your HTML and the browser would render it. This feature was copied into HTML5.
- zoogeny 10 months ago
  
  When you say "those people", you mean people like me who (used to) have to navigate how to model structured data using XML. I think the attribute vs. child distinction makes sense in a very flat hierarchy where you are marking up text but quickly devolves into ambiguity for many other uses cases.
  I mean, if I'm modeling a <Person> node in some structured format, making a decision about "what is the attribute of the person node" vs "what is a property of the specific Person" isn't an easy call to make in all cases. And then there are cases where an attribute itself ought to have some kind of hierarchy. Even the text example works here: I have a set of font properties and it would make sense to maybe have:
  <font> <color>...</color> <family>...<family> </font>
  Rather than a series of `fontFamily`, `fontSize`, etc. attributes. This is true when those attributes are complex objects that ended up having nesting at several levels. You end up in the circumstance where you are forced to make things that ought to be attributes into children because you want to model the nested structure of the attributes themselves. Then you end up with some kind of wrapper structure where you might have a section for meta-data and a section for the real content.
  I just don't think the distinction works well for an extensible markup language where the nesting of elements is more or less the entire point.
  It is much easier to write out though, which is why you see often see `<Element content=" ... " />` patterns all over the place.
  
  1 reply →

froh 10 months ago

a lisp... like dsssl ? ;-)