← Back to context

Comment by Dylan16807

1 year ago

I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.

I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.

The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.

[1] https://ucsx.org/why#3.1

  • That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.

    • The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.

      My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)

      Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.