Comment by chowells

1 year ago

Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?

Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.

Many things makes sense to me, but as we can all guess, this will never become a thing :(

But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.

As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.

  • "If this standard is accepted, if no magic number you have corrected UTF-8."

    That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.

  • You do realize that it's the UNIX people who are the strange ones here? The CRLF has been used as line delimiter by everyone (except IBM who always lived in their own special EBCDIC land) since late sixties, but then Thompson decided that he'd rather do LF-to-CRLF translation in the kernel tty driver than store the text on the disk as-is, like literally every other OS did (and continued to do).

    Besides, the terminal emulators nowadays speak UTF-8 natively; and they absolutely do behave differently for naked LF and CRLF, and you can see it for yourself if you exec "stty -onlcr" and then try to echo or cat some stuff. Sure, you can try to persuade every single terminal emulator's author to adopt "automatic carriage return" but most will refuse to; and you will also need to somehow persuade people to stop emitting CR+LF combination in the raw mode... but then you'll need to give them back the old LF functionality (go down one line, scroll if necessary) somehow. Now, such functionality exists as the IND character — which is in the now forbidden C1 block. Simply amazing!

    • I gotta side with Thompson on this one.

      There's no point in a carriage return without a newline. So why have both just because of the 1933 teletype's hardware implementation? It's purely a hardware thing. That's why Multics used \n, and that's likely why Thompson chose to continue that practice.

      When ASCII came about, it wasn't really about text files. Computers didn't talk to each other back then. ASCII was about sending characters between devices, and for compatibility reasons a lot of devices copied \r\n from the teletype. But there were a lot of devices that didn't as well. Putting it in the driver makes perfect sense from the point of view of someone developing a system in the 1960s.

      2 replies →

    • > "like literally every other OS did (and continued to do)."

      If I remember correctly macs used a bare carrage return as the line delimiter.

      So the trick when you got a text document was to figure out where it came from.

      windows = crlf mac = cr unix = lf

      I suspect nowadays(don't have a mac so this is a guess) because macs are more or less a unix system they default to linefeeds.

      2 replies →

  • Magic numbers do appear a lot in C# programs. The default text encoder will output a BOM marker.