← Back to context

Comment by jasonwatkinspdx

10 hours ago

You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.

This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.

UTF-16 also had a bunch of unfortunate ramifications on the overall design of Unicode, e.g. requiring a substantial chunk of BMP to be reserved for surrogate characters and forcing Unicode codepoints to be limited to U+10FFFF.