Comment by duskwuff
1 year ago
1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
Yes it should be completely incompatible with UTF-8 not only partially. As in, anything beyond ASCII should be invalid and not decodable as UTF.