Comment by vintermann

7 hours ago

UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?

Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because

* It's not discrete. Some code points are for combining with other code points.

* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.

* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.

I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.

By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.

4 comments

vintermann

Cloudef 3 hours ago

Indeed, one pain point of unicode is CJK unification. https://heistak.github.io/your-code-displays-japanese-wrong/

syncsynchalt 6 hours ago

UTF-7 is/was mostly for email, which is not an 8-bit clean transport. It is obsolete and can't encode supplemental planes (except via surrogate pairs, which were meant for UTF-16).

There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.

syncsynchalt 3 hours ago

I meant to specify, the aim of UTF-7 is better performed by using UTF-8 with `Content-Transfer-Encoding: quoted-printable`

pornel 2 hours ago

Unicode wanted ability to losslessly roundtrip every other encoding, in order to be easy to partially adopt in a world where other encodings were still in use. It merged a bunch of different incomplete encodings that used competing approaches. That's why there are multiple ways of encoding the same characters, and there's no overall consistency to it. It's hard to say whether that was a mistake. This level of interoperability may have been necessary for Unicode to actually win, and not be another episode of https://xkcd.com/927