Comment by paulsutter

8 hours ago

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

  • The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.

    The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".

    • I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

      It's now 2026, everything always looks different in hindsight.

      1 reply →

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.