Comment by gucci-on-fleek

21 hours ago

> The 24 Bits (3 Bytes) [3]u8 to u24 example is exactly related to utf-8 that covers all the languages but excludes the emojis.

I'm not familiar with Zig, so maybe it's doing something weird here, but that doesn't really make sense with Unicode in general.

First, the largest Unicode codepoint that will ever be allocated is U+10FFFF [0], which is less than 2^21, so all Unicode characters will fit in a 24-bit integer. Perhaps you're thinking of UCS-2 or UTF-16 without surrogates, which are both 16 bits wide and are limited to the BMP [1] [2] (and therefore don't include most emojis).

Second, while the characters needed for most languages lie within the BMP, not all of them do [3], so it isn't really possible to support all languages while excluding emoji, aside from using the Unicode character database to exclude certain categories [4] [5].

[0]: https://www.unicode.org/faq/utf_bom.html#gen0

[1]: https://www.unicode.org/faq/utf_bom.html#utf16-11

[2]: https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

[3]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_...

[4]: https://www.unicode.org/reports/tr44/tr44-34.html#General_Ca...

[5]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...

5 comments

gucci-on-fleek

nyrikki 19 hours ago

Note the utf-8[0] in my response, the answers are on the pages you linked, but not in the sections you linked,

utf-8 encodes code points in one to four bytes, it is byte oriented vs utf-16 etc. In zig u8 is a byte, and is also (by convention) a char, although there isn't an explicit char type in zig. Technically there are chars in languages that need all 4 bytes in utf-8, but almost all of them are historical or emoji's in utf-8.

24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean. 16 bits (2 bytes) gets you Latin letters with diacritics, Greek, and Arabic scripts. With 8 bits (1 byte) getting you Standard ASCII etc...

There is a point you could make that it may have been better to use utf-16 etc... and that we should have dropped ascii/latin-1 support, but once again go up to the 'Basic Multilingual Plane' in your [3] and notice that is covered by 24bits (3 bytes) in utf-8 encoding.

[0] https://en.wikipedia.org/wiki/UTF-8

rmunn 19 hours ago
> ... but almost all of them are historical or emoji's in utf-8.
I just posted a comment, five minutes after you wrote that, which I won't repeat here since it was quite long. But one of the languages whose alphabet is found in the higher multilingual plane is Fulani, spoken natively by 37 million people (plus another two and a half million who have learned it as a second language). While it can be written in other alphabets (both Latin and Arabic have been used to write it in the past, for example), other alphabets don't usually represent all the sounds of the language properly, making it awkward. There's a reason why the Adlam script was invented to write Fulani with; and that invention was recent enough that it was assigned the U+1E900 to U+1E95F block, since the basic multilingual plane was full by then.
So although it's easy to think that the astral planes are only used for emoji and historical languages, that's not actually true. There are languages spoken by millions of people in those astral planes as well (yes, languages plural; Fulani isn't the only one, it's just the largest).
- nyrikki 17 hours ago
  
  To be clear, I was talking about a use case, not all use cases.
  There are very real times where you have to support all 4 bytes, there are others where other drivers require you to restrict the domain of discorse.
  It doesn't change the value/cost of bit casting in a language with arbitrary bit width languages, especially when combined with the fact that int overflows are detectable illegal behaviour and you have saturating and wrapping operators.
  This is in addition to the ease of using packed structs I mentioned above.
  A list of some advantages:
  * Zig's arbitrary-sized integers have a fully defined ABI for padding
  * Allows for strict domain modeling using them as platform independent refinement types
  * Precise memory packing, allowing more utilization of register space etc...
  * OOB compile time checks
  * Bit masking optimization, where sequential changes to packed values are often merged into a small number of and/or masks
  To move to a more information theory example:
  DNA nucleotides (A, C, G, T) represents quaternary state pairs.
  If you wanted to store an array of 1,000 DNA nucleotides, each symbol is one of 4 bases, requiring exactly 2 bits of information. The Shannon Information would be: 1000 * 2bits = 2000 bits.
  With uint8_t this would take 8k bits, vs 2k bits of u2. That is 300% more for uint8_t.
  It is still horses for courses, but as an example consider 12-bit sensor reading in a standard u16, the data type allows invalid states. To ensure safety, requires manual defensive logic throughout your program in the traditional C/Rust/...
  That traditional model in zig:
  fn processSensor(value: u16) !void { if (value > 4095) return error.InvalidSensorData; // Extra logic branch // ... logic ...
  And the lower overall Kolmogorov complexity (cherry picked) form:
  fn processSensor(value: u12) void { // Zero validation boilerplate code required here
  C23 does have _BitInt types for structs which can help if bit packing is your primary need, IMHO it doesn't offer the same advantages.
  As an example, and I may be wrong, but I think you cant easily perform checked arithmetic or use standard overflow operations on individual C bit-fields without copying them out into standard standard types (like int), modifying them, masking them, and copying them back.
  With Zig the invariant is maintained implicitly at the type layer, removing runtime validation branches, error paths, and testing code
  Does it solve all problems, no. Is @bitCast, a zero runtime overhead, compile-time checked bit reinterpretation and [3]u8 \to u24 useless and silly, no.
  
  1 reply →
gucci-on-fleek 17 hours ago

> utf-8 encodes code points in one to four bytes, it is byte oriented vs utf-16 etc. In zig u8 is a byte, and is also (by convention) a char, although there isn't an explicit char type in zig. […]
> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean. 16 bits (2 bytes) gets you Latin letters with diacritics, Greek, and Arabic scripts. With 8 bits (1 byte) getting you Standard ASCII etc...
Ah ok, so if I understand you correctly, you're taking a variable-length encoding (UTF-8), and limiting and/or padding it to 3 octets (24 bits)? In that case, what you said in your original post makes sense, but I'm not really sure why you'd ever want to encode something this way: you have to deal with the complexities of a variable-length encoding to parse each u24, you have the poor space usage of a fixed-length encoding, and you're using 24 bits to encode only 0xFFFF characters (even though you can fit all of Unicode in only 21 bits).
> Technically there are chars in languages that need all 4 bytes in utf-8, but almost all of them are historical or emoji's in utf-8.
Yes, the majority of the characters in the non-BMP planes are for archaic languages, but that's not really the right way to look at it, since most languages only need <100 characters, and there are more dead languages than living ones. Instead, I'd look at it from the reverse lens of how many living languages need non-BMP characters. This sibling comment [0] gives one example, but there are lots more [1] [2] [3] [4] [5] [6].
Now, it's fine to not support these characters, but the argument in that case should be that you've decided that the characters aren't important enough to outweigh the technical challenges, not that nobody needs the characters.
> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean.
It gets you a subset of CJK that's probably sufficient for many purposes, but there are nearly 75k CJK characters outside of the BMP.
> There is a point you could make that it may have been better to use utf-16 etc... and that we should have dropped ascii/latin-1 support, but once again go up to the 'Basic Multilingual Plane' in your [3] and notice that is covered by 24bits (3 bytes) in utf-8 encoding.
If you are willing and able to use a 24-bit encoding, then I'd argue that you should just use UCS-3/UTF-24, since those allow you to encode every Unicode character. The only downside is that these encodings aren't formally-defined so other programs won't understand them, but if that's an issue you can use UCS-4/UTF-32.
[0]: https://en.wikipedia.org/wiki/Ethiopic_Extended-B