Comment by lifthrasiir
1 year ago
If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.
1 year ago
If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.
I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
[1] https://ucsx.org/why#3.1
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
1 reply →