← Back to context

Comment by lifthrasiir

1 year ago

If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.

I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.

I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.

  • The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.

    [1] https://ucsx.org/why#3.1

    • That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.

      1 reply →