Comment by lifthrasiir
1 year ago
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.