Comment by rmunn
19 hours ago
> ... but almost all of them are historical or emoji's in utf-8.
I just posted a comment, five minutes after you wrote that, which I won't repeat here since it was quite long. But one of the languages whose alphabet is found in the higher multilingual plane is Fulani, spoken natively by 37 million people (plus another two and a half million who have learned it as a second language). While it can be written in other alphabets (both Latin and Arabic have been used to write it in the past, for example), other alphabets don't usually represent all the sounds of the language properly, making it awkward. There's a reason why the Adlam script was invented to write Fulani with; and that invention was recent enough that it was assigned the U+1E900 to U+1E95F block, since the basic multilingual plane was full by then.
So although it's easy to think that the astral planes are only used for emoji and historical languages, that's not actually true. There are languages spoken by millions of people in those astral planes as well (yes, languages plural; Fulani isn't the only one, it's just the largest).
To be clear, I was talking about a use case, not all use cases.
There are very real times where you have to support all 4 bytes, there are others where other drivers require you to restrict the domain of discorse.
It doesn't change the value/cost of bit casting in a language with arbitrary bit width languages, especially when combined with the fact that int overflows are detectable illegal behaviour and you have saturating and wrapping operators.
This is in addition to the ease of using packed structs I mentioned above.
A list of some advantages:
* Zig's arbitrary-sized integers have a fully defined ABI for padding
* Allows for strict domain modeling using them as platform independent refinement types
* Precise memory packing, allowing more utilization of register space etc...
* OOB compile time checks
* Bit masking optimization, where sequential changes to packed values are often merged into a small number of and/or masks
To move to a more information theory example:
DNA nucleotides (A, C, G, T) represents quaternary state pairs.
If you wanted to store an array of 1,000 DNA nucleotides, each symbol is one of 4 bases, requiring exactly 2 bits of information. The Shannon Information would be: 1000 * 2bits = 2000 bits.
With uint8_t this would take 8k bits, vs 2k bits of u2. That is 300% more for uint8_t.
It is still horses for courses, but as an example consider 12-bit sensor reading in a standard u16, the data type allows invalid states. To ensure safety, requires manual defensive logic throughout your program in the traditional C/Rust/...
That traditional model in zig:
And the lower overall Kolmogorov complexity (cherry picked) form:
C23 does have _BitInt types for structs which can help if bit packing is your primary need, IMHO it doesn't offer the same advantages.
As an example, and I may be wrong, but I think you cant easily perform checked arithmetic or use standard overflow operations on individual C bit-fields without copying them out into standard standard types (like int), modifying them, masking them, and copying them back.
With Zig the invariant is maintained implicitly at the type layer, removing runtime validation branches, error paths, and testing code
Does it solve all problems, no. Is @bitCast, a zero runtime overhead, compile-time checked bit reinterpretation and [3]u8 \to u24 useless and silly, no.
Yes, there are certainly use cases where you know the data you're parsing will only come from a narrow range of Unicode, such as U+0000 to U+007F — or from just the letters GCAT, as you mentioned. The overhead of converting 8-bit input to 7-bit might not be worth the cost, but the benefit of storing your input in just 2 bits per "letter" is definitely worth it.
I mostly wanted to make sure people know that the upper multilingual planes are a very real use case, and you need to test them. This is more important for languages such as C# where UTF-16 is the norm: many programmers don't know that they're handling surrogate pairs wrong until someone tries to backspace over an emoji character and it turns into something weird. It's probably less relevant to Zig, which didn't make the mistake that C# and Java did by starting out with UCS-2 (to be fair to them, they were designed in the era where people still thought that 65,536 codepoints would be enough for every language and Unicode would never need more than 16 bits). But the upper planes are important, and need to be tested no matter what language your code is written in.