Comment by kmeisthax

1 year ago

> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.

UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.

I can't think of a good way for them to do this, though:

1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?

2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.

3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.

The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.

If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:

- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)

- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)

- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)

- Extension surrogate: 0b110110 + 10 bits of additional codepoint index

U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.

Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.

You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.

But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.

[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.

A while back I came up with the idea to carve out 4096 code points in plane 14 (Supplementary Special-purpose Plane) for super-surrogates, and use three such surrogates (1 initial, 2 extension) for codepoints beginning from U+110000. If done properly you get unlimited range and self-synchronizing, at the expense of needing 12 bytes minimum per codepoint (more if you want it truly unlimited), but I figured the demand for UTF-16 would be low enough by the time it's needed that it's a workable tradeoff.