Comment by jasode

1 day ago

>- XHTML. [...] Would it kill people to have to close their tags properly?

XHTML appeals to the intuition that there should be a Strict Right Way To Do Things ... but you can't use that unforgiving framework for web documents that are widely shared.

The "real world" has 2 types of file formats:

(1) file types where consumers cannot contact/control/punish the authors (open-loop) : HTML, pdf, zip, csv, etc. The common theme is that the data itself is more important that the file format. That's why Adobe Reader will read malformed pdf files written by buggy PDF libraries. And both 7-Zip and Winrar can read malformed zip files with broken headers (because some old buggy Java libraries wrote bad zip files). MS Excel can import malformed csv files. E.g. the Citi bank export to csv wrote a malformed file and it was desirable that MS Excel imported it anyway because the raw data of dollar amounts was more important than the incorrect commas in the csv file -- and -- I have no way of contacting the programmer at Citi to tell them to fix their buggy code that created the bad csv file.

(2) file types where the consumer can control the author (closed-loop): programming language source code like .c, .java, etc or business interchange documents like EDI. There's no need to have a "lenient forgiving" gcc/clang compiler to parse ".c" source code because the "consumer-and-author" will be the same person. I.e. the developer sees the compiler stop at a syntax error so they edit and fix it and try to re-compile. For business interchange formats like EDI, a company like Walmart can tell the vendor to fix their broken EDI files.

XHTML wants to be in group (2) but web surfers can't control all the authors of .html so that's why lenient parsing of HTML "wins". XHTML would work better in a "closed-loop" environment such as a company writing internal documentation for its employees. E.g. an employee handbook can be written in strict XHTML because both the consumers and authors work at the same company. E.g. can't see the vacation policy because the XHTML syntax is wrong?!? Get on the Slack channel and tell the programmer or content author to fix it.

The problem is that group (1) results in a nightmarish race-to-the-bottom. File creators have zero incentive to create spec-compliant files, because there's no penalty for creating corrupted files. In practice this means a large proportion of documents are going to end up corrupt. Does it open in Chrome? Great, ship it! The file format is no longer the specification, but it has now become a wild guess at whatever weird garbage the incumbent is still willing to accept. This makes it virtually impossible to write a new parser, because the file format suddenly has no specification.

On the other hand, imagine a world where Chrome would slowly start to phase out its quirks modes. Something like a yellow address bar and a "Chrome cannot guarantee the safety of your data on this website, as the website is malformed" warning message. Turn it into a red bar and a "click to continue" after 10 years, remove it altogether after 20 years. Suddenly it's no longer that one weird customer who is complaining, but everyone - including your manager. Your mistakes are painfully obvious during development, so you have a pretty good incentive to properly follow the spec. You make a mistake on a prominent page and the CTO sees it? Well, guess you'll be adding an XHTML validator to your CI pipeline next week!

It is very tempting to write a lenient parser when you are just one small fish in a big ecosystem, but over time it will inevitably lead to the degradation of that very ecosystem. You need some kind of standards body to publish a validating reference parser. And like it or not, Chrome is big enough that it can act as one for HTML.

  • >File creators have zero incentive to create spec-compliant files, because there's no penalty for creating corrupted files

    This depends. If you are a small creator with a unique corruption then you're likely out of luck. The problem with big creators is 'fuck you' I do what I want.

    >"Chrome cannot guarantee the safety of your data on this website, as the website is malformed" warning message.

    This would appear on pretty much every website. And it would appear on websites that are no longer updated and they'd functionally disappear from any updated browser. In addition the 10-20 year thing just won't work in US companies, simply put if they get too much pressure next quarter on it, it's gone.

    >Your mistakes are painfully obvious during development,

    Except this isn't how a huge number of websites work. They get html from many sources and possibly libraries. Simply put no one is going to follow your insanity, hence why xhtml never worked in the first place. They'll drop Chrome before they drop the massive amount of existing and potential bugs out there.

    >And like it or not, Chrome is big enough that it can act as one for HTML.

    And hopefully in a few years between the EU and US someone will bust parts of them up.

    • We don't accept this from any other file format - why is HTML different? For example, if I include random blocks of data in a JPEG file, the picture is all broken or the parser gives up (which is often turned into a partial picture by some abstraction layer that ignores the error code) - in both cases the end user treats as completely broken. If I add random bytes into a Word or LibreOffice document I expect it not to load at all.

  • That would break decades of the web with no incentive for Google to do so. Plus, any change of that scale that they make is going to draw antitrust consideration from _somebody_.

  • You’re right, but even standards bodies aren’t enough. At the end of the day, it’s always about what the dominant market leader will accept. The standard just gives your bitching about the corrupted files some abstract moral authority, but that’s about it.

I’d argue a good comparison here is HTTPS. Everyone decided it would be good for sites to move over to serving via HTTPS so browsers incentivised people to move by gating newer features to HTTPS only. They could have easily done the same with XHTML had they wanted.

  • The opportunities to fix this were pretty abundant. For instance, it would take exactly five words from Google to magically make a vast proportion of web pages valid XHTML:

    > We rank valid XHTML higher

    It doesn’t even have to be true!

> That's why Adobe Reader will read malformed pdf files written by buggy PDF libraries.

No, the reason is that Adobe’s implementation never bothered to perform much validation, and then couldn’t add strict validation retroactively because it would break too many existing documents.

And it’s really the same for HTML.

This is an argument for a repair function that transforms a broken document into a well-formed one without loss but keeps the spec small, simple and consistent. It's not an argument for baking malformations into a complex messy spec.

We could've made the same arguments for supporting Adobe Flash on the iPhone.

And yet Apple decided that no, this time we do it the "right" way[1], stuck with plain HTML/CSS/JS and frankly we're all better for it.

[1] I'm aware this is a massive oversimplification and there were more cynical reasons behind dropping the flash runtime from iOS, but they're not strictly relevant to this discussion.