← Back to context

Comment by LegionMammal978

1 year ago

The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.

My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)

Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.