Comment by GuB-42

1 day ago

Is 21 bits really a sacrifice. It is 2 million codepoints, we currently use about a tenth of that.

Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.

We are probably never need more than 21 bits unless we start stretching the definition of what text is.

It's not 2 million, it's a little over 1 million.

The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).

  • Even with just 1 million codepoints, why did they feel the need for CJK unification? Was it so it would all fit in UCS-2 or something?

    • Yes, that was exactly the reason. CJK unification happened during the few years when we were all trying to convince ourselves that 16 bits would be enough. By the time we acknowledged otherwise, it was too late.