Comment by rmunn

4 hours ago

UTF-8 may still be a good choice for Japanese text, though.

For one thing, pure text is often not the only thing in the file. Markup is often present, and most markup syntaxes (such as HTML or XML) use characters from the ASCII range for the markup, so those characters are one byte (but would be two bytes in UTF-16). Back when the UTF-8 Everywhere manifesto (https://utf8everywhere.org/) was being written, they took the Japanese-language Wikipedia article on Japan, and compared the size of its HTML source between UTF-8 and UTF-16. (Scroll down to section 6 to see the results I'm about to cite). UTF-8 was 767 KB, UTF-16 was 1186 KB, a bit more than 50% larger than UTF-8. The space savings from the HTML markup outweighed the extra bytes from having a less-efficient encoding of Japanese text. Then they did a copy-and-paste of just the Japanese text into a text file, to give UTF-16 the biggest win. There, the UTF-8 text was 222 KB while the UTF-16 encoding got it down to 176 KB, a 21% win for UTF-16 — but not the 50% win you would have expected from a naive comparison, because Japanese text still uses many characters from the ASCII set (space, punctuation...) and so there are still some single-byte UTF-8 characters in there. And once the files were compressed, both UTF-8 and UTF-16 were nearly the same size (83 KB vs 76 KB) which means there's little efficiency gain anyway if your content is being served over a gzip'ed connection.

So in theory, UTF-8 could be up to 50% larger than UTF-16 for Japanese, Chinese, or Korean text (or any of the other languages that fit into the higher part of the basic multilingual place). But in practice, even giving the UTF-16 text every possible advantage, they only saw a 20% improvement over UTF-8.

Which is not nearly enough to justify all the extra cost of suddenly not knowing what encoding your text file is in any more, not when we've finally reached the point of being able to open a text file and just know the encoding.

P.S. I didn't even mention the Shift JIS encoding, and there's a reason I didn't. I've never had to use it "for real", but I've read about it. No. No thank you. No. Shudder. I'm not knocking the cleverness of it, it was entirely necessary back when all you had was 8 bits to work with. But let me put it this way: it's not a coincidence that Japan invented a word (mojibake) to represent what happens when you see text interpreted in the wrong encoding. There were multiple variations of Shift JIS (and there was also EUC-JP just to throw extra confusion into the works), so Japanese people saw garbled text all the time as it moved from one computer running Windows, to an email server likely running Unix, to another computer running Windows... it was a big mess. It's also not a coincidence that (according to Wikipedia), 99.1% of Japanese websites (defined as "in the .jp domain") are encoded in UTF-8, while Shift JIS is used by only 1% (probably about 0.95% rounded up) of .jp websites.

So in practice, nearly everyone in Japan would rather have slightly less efficient encoding of text, but know for a fact that their text will be read correctly on the other end.

0 comments

rmunn

No comments yet

Contribute on Hacker News ↗