Comment by Andrex

5 months ago

What are the perceived benefits of UTF-16 and 32 and why did they come about?

I could ask Gemini but HN seems more knowledgeable.

UTF-16 is a hack that was invented when it became clear that UCS-2 wasn't gonna work (65536 codepoints was not enough for everybody).

Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.

There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.

  • Besides Microsoft, plenty of others thought UTF-16 to be a good idea. The Haskell Text type used to be based on UTF-16; it only switched to UTF-8 a few years ago. Java still uses UTF-16, but with an ad hoc optimization called CompactStrings to use ISO-8859-1 where possible.

    • A lot of them did it because they had to have a Windows version and had to interface with Windows APIs and Windows programs that only spoke UTF-16 (or UCS-2 or some unspecified hybrid).

      Java's mistake seems to have been independent and it seems mainly to have been motivated by the mistaken idea that it was necessary to index directly into strings. That would have been deprecated fast if Windows had been UTF-8 friendly and very fast if it had been UTF-16 hostile.

      We can always dream.

      1 reply →

  • Thank you! That's interesting.

    What about UTF-7? That seemed like a bad idea even at the time.