← Back to context

Comment by jasode

10 hours ago

>Why is this a problem for HTML - and only HTML?

Your premise is not correct because you're not aware that other data formats also have parsers that accept malformed content. Examples:

- pdf files: many files with errors can be read by Adobe Acrobat. And code PDF libraries for developers often replicate this behavior so they too can also open the same invalid pdf files.

- zip files. 7-Zip and WinRAR can open some malformed zip files that don't follow the official PKZIP specification. E.g. 7-Zip has extra defensive code that looks for a bad 2-byte sequence that shouldn't be there and skips over it.

- csv files. MS Excel can read some malformed csv files.

- SMTP email headers: Mozilla Thunderbird, MS Outlook, etc can parse fields that don't exactly comply with RFC 822 -- make some guesses -- and then successfully display the email content to the user

The common theme to the above, including HTML... the Raw Content is more important than a perfectly standards-compliant file format. That's why parsers across various domains make best efforts to load the file even when it's not 100% free of syntax errors.

>Meanwhile, in any other formal language (including JS and CSS!), the standard assumption is that syntax errors are fatal,

Parsing invalid CSS is not a fatal error. Example of validating HTML/CSS in a job listings webpage at Monster.com : https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.monster.c...

It has CSS errors such as:

  Error: CSS: background-color: none is not a background-color value.  From line 276, column 212; to line 276, column 215
  Error: CSS: padding: 8x is not a padding value.

Job hunters in the real world want to see the jobs because the goal is to get a paycheck. Therefor, a web browser that didn't show the webpage just because the author mistakenly wrote CSS "none" instead "transparent" and "8x" instead of "8px" -- would be user hostile software.

> csv files. MS Excel can read some malformed csv files.

At work we have to parse CSV files which often have mixed encoding (Latin-1 with UTF-8 in random fields on random rows), occasionally have partial lines (remainder of line just missing) and other interesting errors.

We also have to parse fixed-width flat files where fields occasionally aren't fixed-width after all, with no discernible pattern. Customer can't fix the broken proprietary system that spits this out so we have to deal with it.

And of course, XML files with encoding mismatch (because that header is just a fixed string that bears no meaning on the rest of the content, right?) or even mixed encoding. That's just par for the course.

Just some examples of how fun parsing can be.