Comment by GuB-42
4 hours ago
Is 21 bits really a sacrifice. It is 2 million codepoints, we currently use about a tenth of that.
Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.
We are probably never need more than 21 bits unless we start stretching the definition of what text is.
It's not 2 million, it's a little over 1 million.
The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).