← Back to context

Comment by gucci-on-fleek

17 hours ago

> utf-8 encodes code points in one to four bytes, it is byte oriented vs utf-16 etc. In zig u8 is a byte, and is also (by convention) a char, although there isn't an explicit char type in zig. […]

> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean. 16 bits (2 bytes) gets you Latin letters with diacritics, Greek, and Arabic scripts. With 8 bits (1 byte) getting you Standard ASCII etc...

Ah ok, so if I understand you correctly, you're taking a variable-length encoding (UTF-8), and limiting and/or padding it to 3 octets (24 bits)? In that case, what you said in your original post makes sense, but I'm not really sure why you'd ever want to encode something this way: you have to deal with the complexities of a variable-length encoding to parse each u24, you have the poor space usage of a fixed-length encoding, and you're using 24 bits to encode only 0xFFFF characters (even though you can fit all of Unicode in only 21 bits).

> Technically there are chars in languages that need all 4 bytes in utf-8, but almost all of them are historical or emoji's in utf-8.

Yes, the majority of the characters in the non-BMP planes are for archaic languages, but that's not really the right way to look at it, since most languages only need <100 characters, and there are more dead languages than living ones. Instead, I'd look at it from the reverse lens of how many living languages need non-BMP characters. This sibling comment [0] gives one example, but there are lots more [1] [2] [3] [4] [5] [6].

Now, it's fine to not support these characters, but the argument in that case should be that you've decided that the characters aren't important enough to outweigh the technical challenges, not that nobody needs the characters.

> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean.

It gets you a subset of CJK that's probably sufficient for many purposes, but there are nearly 75k CJK characters outside of the BMP.

> There is a point you could make that it may have been better to use utf-16 etc... and that we should have dropped ascii/latin-1 support, but once again go up to the 'Basic Multilingual Plane' in your [3] and notice that is covered by 24bits (3 bytes) in utf-8 encoding.

If you are willing and able to use a 24-bit encoding, then I'd argue that you should just use UCS-3/UTF-24, since those allow you to encode every Unicode character. The only downside is that these encodings aren't formally-defined so other programs won't understand them, but if that's an issue you can use UCS-4/UTF-32.

[0]: https://en.wikipedia.org/wiki/Ethiopic_Extended-B