Comment by tannhaeuser

6 months ago

Nice to meet a fellow SGML fan!

> When you say that macros/entities may contain parameters, I think you are referring to recursive entity expansion,

No, I'm referring to SGML data attributes (attributes declared on notations having concrete values defined on entities of the respective notation); cf. [1]. In sgmljs.net SGML, these can be used for SGML templating which is a way of using data entities declared as having the SGML notation (ie. stand-alone SGML files or streams) to replace elements in documents referencing those entities. Unlike general entities, this type of entity expansion is bound to an element name and is informed of the expected content model and other contextual type info at the replacement site, hence is type-safe. Data attributes supplied at the expansion site appear as "system-specific entities" in the processing context of the template entity. See [2] for details and examples.

Understanding and appreciating the construction of templating as a parametric macro expansion mechanism without additional syntax may require intimate knowledge of lesser known SGML features such as LPDs and data entities, and also some HyTime concepts.

> create an entity to generate a Github repo link

Templating can turn text data from a calling document into an entity in the called template sub-processing context so might help with your use case, and with the limitation to have to declare things in DTDs upfront in general.

> it’s not possible to parse a document without having the entire thing from the start, and because of things like tag omission (which is part of its syntax “MINIMIZATION” features), it’s often not possible to parse a document without having _everything up to the end_.

Why do you think so and why should this be required by tag inference specifically? In sgmljs.net SGML, for external general entities (unlike external parameter entities which are expanded at the point of declaration rather than usage), at no point does text data have to be materialised in its entirety. The parser front-end just switches input events from another external source during entity expansion and switches back afterwards, maintaining a stack of open entities.

Regarding namespaces, one of their creators (SGML demi-good James Clark himself) considers those a failure:

> the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide (cf. [3]).

In sgmljs.net SGML, you can handle XML namespace mappings using the special processing instructions defined by ISO/IEC 19757-9:2008. In effect, element and attributes having names "with colons" are remapped to names with canonical namespace parts (SGML names can allow colons as part of names), which seems like the sane way to deal with "namespaces".

I haven't checked your site, but most certainly will! Let's keep in touch; you might also be interested in sgmljs.net SGML and the SGML DTD for modern HTML at [4], to be updated for WHATWG HTML review draft January 2025 when/if it's published.

Edit:

> Would love to hear what you are referring to with “safe” third-party transclusion and also what features are available for removal or rejection of undesired script in user content.

In short, I was mainly referring to DTD techniques (content models, attribute defaults) here.

[1]: https://sgmljs.net/docs/sgmlrefman.html#data-entities

[2]: https://sgmljs.net/docs/templating.html

[3]: https://blog.jclark.com/2010/01/xml-namespaces.html

[4]: https://sgmljs.net/docs/html5.html

4 comments

tannhaeuser

dmsnell 5 months ago

Took me a while to process your response; thanks for providing to so much to think about. And yes, it is indeed nice meeting other SGML enthusiasts!

Can’t say I’m familiar with the use of notations the way you are describing, and it doesn’t help that ISO-8879-1986 is devoid of any reference to the italicized or quoted terms you’re using. Mind sharing where in the spec you see these data attributes?

The SGML spec is hands-down the most complicated spec I’ve ever worked with, feeling like it’s akin to how older math books were written before good symbolic notation was developed — in descriptive paragraphs. The entire lack of any examples in the “Link type definition” section is one way it’s been hard to understand the pieces.

> some HyTime concepts

I’m familiar with sgmljs but have you incorporated HyTime into it? That is, is sgmljs more like SGML+other things? I thought it already was in the way it handles UTF-8.

> Why do you think so and why should this be required by tag inference specifically?

Perhaps it’d be clearer to add chunkable to my description of streaming. Either the parser gets the document from the start and stores all of the entities for processing each successive chunk, or it will mis-parse. The same is true not only for entities for also for other forms of MINIMIZATION and structure.

Consider encountering an opening tag. Without knowing that structure and where it was previously, it’s not clear if that tag needs to close open elements. So I’m contrasting this to things that are self-synchronizing or which allow processing chunks in isolation. As with many things, this is an evaluation on the relative complexity of streaming SGML — obviously it’s easier than HTML because HTML nodes at the top of a document can appear at the end of the HTML text; and obviously harder than UTF-8 or JSON-LD where it’s possible to find the next boundary without context and start parsing again.

> Regarding namespaces, one of their creators (SGML demi-good James Clark himself) considers those a failure:

Definitely aware of the criticism, but I also find them quite useful and that they solve concrete problems I have and wish SGML supported. Even when using colons things can get out of hand to the point it’s not convenient to write by hand. This is where default namespacing can be such an aid, because I can switch into a different content domain and continue typing with relatively little syntax.

Of note, in the linked reference, James Clark isn’t arguing against the use of namespaces, but more of the contradiction between what’s written in the XML and the URL the namespace points to, also of the over-reliance on namespacing for meaning in a document.

What feels most valuable is being able to distinguish a flight of multiple segments and a flight of different beers, and to do so without typing this:

  <com.dmsnell.things.liquids.beer.flight>
    <com.dmsnell.things.liquids.beer.beer>

  <com.dmsnell.plans.travel.itineraries.flight>
    <com.dmsnell.plans.travel.flights.segment>

Put another way, I’m not concerned about the problem of universal inter-linking of my documents and their semantics across the web, but neither was SGML.

Cheers!

tannhaeuser 5 months ago
Get a copy of The SGML Handbook! It includes the ISO 8879 text with much needed commentary. The author/editor deliberately negotiated this to get some compensation for spending over ten years on SGML; it's meant to be read instead of the bare ISO 8879 text.
Regarding HyTime, the editor (Steven Newcomb or was it Elliot Kimber?) even apologized for its extremely hard-to-read text, but the gist of it is that notations were used as a general-purpose extension mechanism, with a couple of auxiliary conventions introduced to make attribute capture and mapping in link rules more useful. sgmljs templating uses these to capture attributes in source docs and supply the captured values as system-specific entities to templates (the point being that no new syntax is introduced that isn't already recognized and specced out in HyTime general facilities). The concept and concrete notations of Formal System Identifiers is also a HyTime general facility.
- dmsnell 5 months ago
  
  HyTime was Kimber’s work, and I found this reflections of his on “worse is better” to be quite refreshing, especially when contrasting the formality of SGML against the flexibility of HTML.
  https://drmacros-xml-rants.blogspot.com/2010/08/worse-is-bet...
  I’ll have to spend more time in the SGML Handbook. It’s available for borrowing at https://archive.org/details/sgmlhandbook0000gold. So far, I’ve been focused on the spec and trying to understand the intent behind the rules.
  > a couple of auxiliary conventions introduced to make attribute capture and mapping in link rules more useful
  Do you have any pointers on how to find these in the spec or in the handbook? Some keywords? What I’ve gathered is that notations point to some external non-SGML processor to handle the rendering and interpretation of content, such as images and image-viewing programs.
  Cheers!
  
  1 reply →