← Back to context

Comment by vintermann

5 hours ago

UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?

Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because

* It's not discrete. Some code points are for combining with other code points.

* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.

* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.

I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.

By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.

UTF-7 is/was mostly for email, which is not an 8-bit clean transport. It is obsolete and can't encode supplemental planes (except via surrogate pairs, which were meant for UTF-16).

There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.

  • I meant to specify, the aim of UTF-7 is better performed by using UTF-8 with `Content-Transfer-Encoding: quoted-printable`