Comment by tannhaeuser

4 days ago

> copy and paste content and not having layout page is annoying at times

HTML was envisioned as an SGML application/vocabulary, and SGML has those power features, such as type-checked shared fragments/text macros (entities, possibly with parameters), safe third-party content transclusion, markup stream processing and filtering for generating a table of content for page or site navigation, content screening for removal/rejection of undesired script in user content, expansion of custom Wiki syntax such as markdown into HTML, producing "views" for RSS or search result pages in pipelines, etc. etc. See [1] for a basic tutorial.

[1]: https://sgmljs.net/docs/producing-html-tutorial/producing-ht...

I didn't expect this to be serious and am surprised on that the tutorial actually delivers. Way back when I was learning HTML and it was said that it was built with SGML, then this relation remained a total mystery to me.

  • > I didn't expect this to be serious and am surprised on that the tutorial actually delivers.

    Same here. I believed that SGML was like Lisp of markup languages, except that it went completely extinct. Good to see it's still usable, I feel like I want to try it out now (instead of making a third generation of my static site generator from scratch).

I’ve become quite a fan of writing in SGML personally, because much of what you note is spot-on. Some of the points seem a bit of a stretch though.

Any type-checking inside of SGML is more akin to unused-variable checking. When you say that macros/entities may contain parameters, I think you are referring to recursive entity expansion, which does let you parameterize macros (but only once, and not dynamically within the text). For instance, you can set a `&currentYear` entity and refer to that in `copywrite "&currentYear/&currentDay`, but that can only happen in the DTD at the start of the document. It’s not the case that you could, for instance, create an entity to generate a Github repo link and use it like `&repoName = "diff-match-patch"; &githubLink`. This feature was used in limited form to conditionally include sections of markup since SGML contains an `IGNORE` “marked section”.

   <!ENTITY % private-render "IGNORE">
   ...
   <![%private-render[
   <side-note>
   I’m on the fence about including this bit.
   It’s not up to the editorial standards.
   </side-note>
   ]]>

SGML also fights hard against stream processing, even more so than XML (and XML pundits regret not deprecating certain SGML features like entities which obstruct stream processing). Because of things like this, it’s not possible to parse a document without having the entire thing from the start, and because of things like tag omission (which is part of its syntax “MINIMIZATION” features), it’s often not possible to parse a document without having _everything up to the end_.

Would love to hear what you are referring to with “safe” third-party transclusion and also what features are available for removal or rejection of undesired script in user content.

Apart from these I find it a pleasure to use because SGML makes it easy for _humans_ to write structured content (contrast with XML which makes it easy for software to parse). SGML is incredibly hard to parse because in order to accommodate human factors _and actually get people to write structured content_ it leans heavily on computers and software doing the hard work of parsing.

It’s missing some nice features such as namespacing. That is, it’s not possible to have two elements of the same name in the same document with different attributes, content, or meanings. If you want to have a flight record and also a list of beers in a flight, they have to be differentiated otherwise they will fail to parse.

   <flight-list>
   <flight-record><flight-meta pnr=XYZ123 AAL number=123>
   </flight-list>

   <beer-list>
   <beer-flight>
   <beer Pilsner amount=3oz>Ultra Pils 2023
   <beer IPA>Dual IPA
   <beer Porter>Chocolate milk stout
   </beer-list>

DSSSL was supposed to be the transforms into RSS, page views, and other styles or visualizations. With XML arose XSL/XSLT which seemed to gain much more traction than DSSSL ever did. My impression is that declarative transforms are best suited for simpler transforms, particularly those without complicated processing or rearranging of content. Since `osgmls` and the other few SGML parsers are happy to produce an equivalent XML document for the SGML input, it’s easy to transform an SGML document using XSL, and I do this in combination with a `Makefile` to create my own HTML pages (fair warning: HTML _is not XML_ and there are pitfalls in attempting to produce HTML from an XML tool like XSL).

For more complicated work I make quick transformers with WordPress’ HTML API to process the XML output (I know, XML also isn’t HTML, but it parses reliably for me since I don’t produce anything that an HTML parser couldn’t parse). Having an imperative-style processor feels more natural to me, and one written in a programming language that lets me use normal programming conveniences. I think getting the transformer right was never fully realized with the declarative languages, which are similar to Angular and other systems with complicated DSLs inside string attribute values.

I’d love to see the web pick up where SGML left off and get rid of some of the legacy concessions (SGML was written before UTF-8 and its flexibility with input encodings shows it — not in a good way either) as well as adopt some modern enhancements. I wrote about some of this on my personal blog, sorry for the plug.

https://fluffyandflakey.blog/2024/10/11/ugml-a-proposal-to-u...

Edit: formatting

  • Nice to meet a fellow SGML fan!

    > When you say that macros/entities may contain parameters, I think you are referring to recursive entity expansion,

    No, I'm referring to SGML data attributes (attributes declared on notations having concrete values defined on entities of the respective notation); cf. [1]. In sgmljs.net SGML, these can be used for SGML templating which is a way of using data entities declared as having the SGML notation (ie. stand-alone SGML files or streams) to replace elements in documents referencing those entities. Unlike general entities, this type of entity expansion is bound to an element name and is informed of the expected content model and other contextual type info at the replacement site, hence is type-safe. Data attributes supplied at the expansion site appear as "system-specific entities" in the processing context of the template entity. See [2] for details and examples.

    Understanding and appreciating the construction of templating as a parametric macro expansion mechanism without additional syntax may require intimate knowledge of lesser known SGML features such as LPDs and data entities, and also some HyTime concepts.

    > create an entity to generate a Github repo link

    Templating can turn text data from a calling document into an entity in the called template sub-processing context so might help with your use case, and with the limitation to have to declare things in DTDs upfront in general.

    > it’s not possible to parse a document without having the entire thing from the start, and because of things like tag omission (which is part of its syntax “MINIMIZATION” features), it’s often not possible to parse a document without having _everything up to the end_.

    Why do you think so and why should this be required by tag inference specifically? In sgmljs.net SGML, for external general entities (unlike external parameter entities which are expanded at the point of declaration rather than usage), at no point does text data have to be materialised in its entirety. The parser front-end just switches input events from another external source during entity expansion and switches back afterwards, maintaining a stack of open entities.

    Regarding namespaces, one of their creators (SGML demi-good James Clark himself) considers those a failure:

    > the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide (cf. [3]).

    In sgmljs.net SGML, you can handle XML namespace mappings using the special processing instructions defined by ISO/IEC 19757-9:2008. In effect, element and attributes having names "with colons" are remapped to names with canonical namespace parts (SGML names can allow colons as part of names), which seems like the sane way to deal with "namespaces".

    I haven't checked your site, but most certainly will! Let's keep in touch; you might also be interested in sgmljs.net SGML and the SGML DTD for modern HTML at [4], to be updated for WHATWG HTML review draft January 2025 when/if it's published.

    Edit:

    > Would love to hear what you are referring to with “safe” third-party transclusion and also what features are available for removal or rejection of undesired script in user content.

    In short, I was mainly referring to DTD techniques (content models, attribute defaults) here.

    [1]: https://sgmljs.net/docs/sgmlrefman.html#data-entities

    [2]: https://sgmljs.net/docs/templating.html

    [3]: https://blog.jclark.com/2010/01/xml-namespaces.html

    [4]: https://sgmljs.net/docs/html5.html