Comment by Mikhail_Edoshin

5 months ago

I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.

Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.

  • Didn't they limit the range to 21 bits because UTF-16 has that limitation?

    • That is indeed why they limited it, but that was a mistake. I want to call UTF-16 a mistake all on its own, but since it predated UTF-8, I can't entirely do so. But limiting the Unicode range to only what's allowed in UTF-16 was shortsighted. They should, instead, have allowed UTF-8 to continue to address 31 bits, and if the standard grew past 21 bits, then UTF-16 would be deprecated. (Going into depth would take an essay, and at this point nobody cares about hearing it, so I'll refrain).

      5 replies →