Comment by paulsutter

8 hours ago

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

7 comments

paulsutter

wvenable 8 hours ago

Blame the Unicode consortium for not coming up UTF-8 first (or, really, at all). And for assuming that 65526 code points would be enough for everyone.

So many problems could be solved with a time machine.

kstrauser 7 hours ago
The first draft of Unicode was in 1988. Thompson and Pike came up with UTF-8 in 1992, made an RFC in 1998. UTF-16 came along in 1996, made an RFC in 2000.
The time machine would've involved Microsoft saying "it's clear now that USC-2 was a bad idea, so let's start migrating to something genuinely better".
- wvenable 3 hours ago
  
  I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.
  It's now 2026, everything always looks different in hindsight.
  
  1 reply →
- gpvos 4 hours ago
  
  MS could easily have added proper UTF-8 support in the early 2000s instead of the late 2010s.
  
  1 reply →

Dwedit 4 hours ago

It gets worse for UTF-16, Windows will let you name files using unpaired surrogates, now you have a filename that exists on your disk that cannot be represented in UTF-8 (nor compliant UTF-16 for that matter). Because of that, there's yet another encoding called WTF-8 that can represent the arbitrary invalid 16-bit values.