← Back to context

Comment by thaumasiotes

5 hours ago

UTF-8 encodes European glyphs in two bytes and oriental glyphs in three bytes. This is due to the assumption that you're not going to be using oriental glyphs. If you are going to use them, UTF-8 is a very poor choice.

UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?

  • Unicode could have just been encoded statefuly with a "current code page" mark byte.

    With UTF and emojis we can't have random access to characters anyways, so why not go the whole way?

    • Yikes. That would lose the ability to know the meaning of the current bytes, or misinterpret them badly, if you happen to get one critical byte dropped or mangled in transmission. At least UTF-8 is self-syncing: if you end up starting to read in the middle of a non-rewindable stream whose beginning has already passed, you can identify the start of the next valid codepoint sequence unambiguously, and then end up being able to sync up with the stream, and you're guaranteed not to have to read more than 4 bytes (6 bytes when UTF-8 was originally designed) in order to find a sync point.

      But if you have to rely on a byte that may have already gone past? No way to pick up in the middle of a stream and know what went before.

      1 reply →

    • A huge, central, part of UTF-8 design is that you can start decoding it from any arbitrary offset, it is self-aligning.

  • > Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

    Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

    A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

    > Most European languages use variations of the latin alphabet

    If you want to interpret "variations of Latin" really, really loosely, that's true.

    Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

    • As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...

      Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).

      On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.

      It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.

      I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.

    • UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.

      The byte order mark has has no relation to code pages.

      I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.

      EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.

UTF-8 may still be a good choice for Japanese text, though.

For one thing, pure text is often not the only thing in the file. Markup is often present, and most markup syntaxes (such as HTML or XML) use characters from the ASCII range for the markup, so those characters are one byte (but would be two bytes in UTF-16). Back when the UTF-8 Everywhere manifesto (https://utf8everywhere.org/) was being written, they took the Japanese-language Wikipedia article on Japan, and compared the size of its HTML source between UTF-8 and UTF-16. (Scroll down to section 6 to see the results I'm about to cite). UTF-8 was 767 KB, UTF-16 was 1186 KB, a bit more than 50% larger than UTF-8. The space savings from the HTML markup outweighed the extra bytes from having a less-efficient encoding of Japanese text. Then they did a copy-and-paste of just the Japanese text into a text file, to give UTF-16 the biggest win. There, the UTF-8 text was 222 KB while the UTF-16 encoding got it down to 176 KB, a 21% win for UTF-16 — but not the 50% win you would have expected from a naive comparison, because Japanese text still uses many characters from the ASCII set (space, punctuation...) and so there are still some single-byte UTF-8 characters in there. And once the files were compressed, both UTF-8 and UTF-16 were nearly the same size (83 KB vs 76 KB) which means there's little efficiency gain anyway if your content is being served over a gzip'ed connection.

So in theory, UTF-8 could be up to 50% larger than UTF-16 for Japanese, Chinese, or Korean text (or any of the other languages that fit into the higher part of the basic multilingual place). But in practice, even giving the UTF-16 text every possible advantage, they only saw a 20% improvement over UTF-8.

Which is not nearly enough to justify all the extra cost of suddenly not knowing what encoding your text file is in any more, not when we've finally reached the point of being able to open a text file and just know the encoding.

P.S. I didn't even mention the Shift JIS encoding, and there's a reason I didn't. I've never had to use it "for real", but I've read about it. No. No thank you. No. Shudder. I'm not knocking the cleverness of it, it was entirely necessary back when all you had was 8 bits to work with. But let me put it this way: it's not a coincidence that Japan invented a word (mojibake) to represent what happens when you see text interpreted in the wrong encoding. There were multiple variations of Shift JIS (and there was also EUC-JP just to throw extra confusion into the works), so Japanese people saw garbled text all the time as it moved from one computer running Windows, to an email server likely running Unix, to another computer running Windows... it was a big mess. It's also not a coincidence that (according to Wikipedia), 99.1% of Japanese websites (defined as "in the .jp domain") are encoded in UTF-8, while Shift JIS is used by only 1% (probably about 0.95% rounded up) of .jp websites.

So in practice, nearly everyone in Japan would rather have slightly less efficient encoding of text, but know for a fact that their text will be read correctly on the other end.