Comment by josephg

1 year ago

I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.

Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.

https://unicode.org/reports/tr15/

Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).

If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.

There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.

  • Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.

    • I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).

I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.

But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.

Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.

This proposal seems like trying to reverse engineer a normalization form into an encoding form, which at face value having an encoding form that doesn't even technically support denormalized forms sounds like a good thing until you start to read the details on all of the normalization forms and get into the weeds and edge cases of why normal forms are locale specific and why normal forms are so complex even beyond that that you start to question if "single canonical binary encoding for character sequences" is at all possible and I think you start to appreciate why the normal forms are algorithms at a higher level above the raw binary encoding rather than attempted to be built into the binary encoding form.