← Back to context

Comment by mort96

11 hours ago

That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?

Is 21 bits really a sacrifice. It is 2 million codepoints, we currently use about a tenth of that.

Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.

We are probably never need more than 21 bits unless we start stretching the definition of what text is.

  • It's not 2 million, it's a little over 1 million.

    The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).

In an ideal future (read: fantasy), utf-16 gets formally deprecated and trashed, freeing the surrogate sequences and full range for utf-8.

Or utf-16 is officially considered a second class citizen, and some code points are simply out of its reach.