Comment by pwdisswordfishy
16 hours ago
> I never understood why the more strict rules of XML for HTML never took off
Internet Explorer failing to support XHTML at all (which also forced everyone to serve XHTML with the HTML media type and avoid incompatible syntaxes like self-closing <script />), Firefox at first failing to support progressive rendering of XHTML, a dearth of tooling to emit well-formed XHTML (remember, those were the days of PHP emitting markup by string concatenation) and the resulting fear of pages entirely failing to render (the so-called Yellow Screen of Death), and a side helping of the WHATWG cartel^W organization declaring XHTML "obsolete". It probably didn't help that XHTML did not offer any new features over tag-soup HTML syntax.
I think most of those are actually no longer relevant, so I still kind of hope that XHTML could have a resurgence, and that the tag-soup syntax could be finally discarded. It's long overdue.
What I never understood was why, for HTML specifically, syntax errors are such a fundamental unsolvable problem that it's essential that browsers accept bad content.
Meanwhile, in any other formal language (including JS and CSS!), the standard assumption is that syntax errors are fatal, the responsibility for fixing lies with the page author, but also that fixing those errors is not a difficult problem.
Why is this a problem for HTML - and only HTML?
HTML is a markup language to format text, not a programming or data serialization language so end users have always preferred to see imperfectly coded or incompletely loaded web pages imperfectly rendered over receiving a failure message, particularly on 90s dialup. Same applies to most other markup languages.
The web owes its success to having low barriers to entry and very quickly became a mixture of pages hand coded by people who weren't programmers, content produced by CMS systems which included stuff the content author didn't directly control and weren't necessarily reliable at putting tags into the right place, and third party widgets activated by pasting in whatever code the third party had given you. And browsers became really good at attempting to rendering erroneous and ambiguous markup (and for that matter were usually out of date or plain bad at rigidly implementing standards)
There was a movement to serve XHTML as XML via the application/xhtml+xml MIME type but it never took off because browsers didn't do anything with it except loading a user-hostile error page if a closing tag was missed (or refusing to load it at all in the case of IE6 and older browsers), and if you wanted to do clever transformation of your source data, there were ways to achieve that other than formatting the markup sent to the browser as a subset of XML
>Why is this a problem for HTML - and only HTML?
Your premise is not correct because you're not aware that other data formats also have parsers that accept malformed content. Examples:
- pdf files: many files with errors can be read by Adobe Acrobat. And code PDF libraries for developers often replicate this behavior so they too can also open the same invalid pdf files.
- zip files. 7-Zip and WinRAR can open some malformed zip files that don't follow the official PKZIP specification. E.g. 7-Zip has extra defensive code that looks for a bad 2-byte sequence that shouldn't be there and skips over it.
- csv files. MS Excel can read some malformed csv files.
- SMTP email headers: Mozilla Thunderbird, MS Outlook, etc can parse fields that don't exactly comply with RFC 822 -- make some guesses -- and then successfully display the email content to the user
The common theme to the above, including HTML... the Raw Content is more important than a perfectly standards-compliant file format. That's why parsers across various domains make best efforts to load the file even when it's not 100% free of syntax errors.
>Meanwhile, in any other formal language (including JS and CSS!), the standard assumption is that syntax errors are fatal,
Parsing invalid CSS is not a fatal error. Example of validating HTML/CSS in a job listings webpage at Monster.com : https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.monster.c...
It has CSS errors such as:
Job hunters in the real world want to see the jobs because the goal is to get a paycheck. Therefor, a web browser that didn't show the webpage just because the author mistakenly wrote CSS "none" instead "transparent" and "8x" instead of "8px" -- would be user hostile software.
> csv files. MS Excel can read some malformed csv files.
At work we have to parse CSV files which often have mixed encoding (Latin-1 with UTF-8 in random fields on random rows), occasionally have partial lines (remainder of line just missing) and other interesting errors.
We also have to parse fixed-width flat files where fields occasionally aren't fixed-width after all, with no discernible pattern. Customer can't fix the broken proprietary system that spits this out so we have to deal with it.
And of course, XML files with encoding mismatch (because that header is just a fixed string that bears no meaning on the rest of the content, right?) or even mixed encoding. That's just par for the course.
Just some examples of how fun parsing can be.
It's mostly historical. Browsers accepted invalid HTML for 10 years, there's a lot of content authored with that assumption that's never going to be updated, so now we're stuck with it.
We could be more strict for new content, but why bother if you have to include the legacy parser anyway. And the HTML5 algorithm brings us most of the benefits (deterministic parsing) of a stricter syntax while still allowing the looseness.
> never going to be updated, so now we're stuck with it.
Try going to any 1998 web page in a modern browser... It's generally so broken so as to be unusable.
As well as every page telling me to install flash, most links are dead, most scripts don't run properly (vbscript!?), tls versions now incompatible, etc.
We shouldn't put much effort into backwards compatibility if it doesn't work in practice. The best bet to open a 1998 web page is to install IE6 in a VM, and everything works wonderfully.
2 replies →
Syntax errors are not fatal in CSS. CSS has detailed rules for how to handle and recover from syntax errors, usually by skipping the invalid token. This is what allows introducing new syntax in a backwards-compatible manner.
> Meanwhile, in any other formal language (including JS and CSS!), the standard assumption is that syntax errors are fatal,
In CSS, a syntax error isn't fatal. Most of the time, an unrecognized property causes that selector and all its properties to be ignored.
:is() and :where() support forgiving selector list [1].
Only the erroneous properties are ignored; the rest work normally.
[1]: https://drafts.csswg.org/selectors-4/#typedef-forgiving-sele...
> What I never understood was why, for HTML specifically, syntax errors are such a fundamental unsolvable problem that it's essential that browsers accept bad content.
Because HTML is a content language, and at any given time the main purpose of the main engines using it will be to access a large array of content that is older than the newest revision of the language, and anything that creates significant incompatibilities or forces completely rewrites of large bodies of work to incorporate new features in a standard is simply not going to be implemented as specified by the major implementers (it will either not be implemented at all, or will be modified), because it is hostile what the implementations are used for.
Because HTML is designed to be written by everyone, not just “engineers” and we’d rather be able to read what they have to say even if they get it wrong.
It's more that it's exceedingly easy to generate bad X(H)ML strings especially back when you had PHP concatenating strings as you went. Most HTML on the web is live/dynamic so there's no developer to catch syntax errors and "make build" again.
> It probably didn't help that XHTML did not offer any new features over tag-soup HTML syntax.
Well, this is not entirely true: XML namespaces enabled attaching arbitrary data to XHTML elements in a much more elegant, orthogonal way than the half-assed solution HTML5 ended up with (the data-* attribute set), and embedding other XML applications like XForms, SVG and MathML (though I am not sure how widely supported this was at the time; some of this was backported into HTML5 anyway, in a way that later led to CVEs). But this is rather niche.
I was there, Gandalf. I was there 30 years ago. I was there when the strength of men failed.
Netscape started this. NCSA was in favor of XML style rules over SGML, but Netscape embraced SGML leniency fully and several tools of that era generated web pages that only rendered properly in Netscape. So people voted with their feet and went to the panderers. If I had a dollar for every time someone told me, “well it works in Netscape” I’d be retired by now.
Emitting correct XHTML was not that hard. The biggest problem was that browsers supported plugins that could corrupt whole page. If you created XHTML webpage you had to handle bug reports caused by poorly written plugins.