← Back to context

Comment by toast0

6 hours ago

The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of

Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?

I suspect the answer is

a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings

b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction

You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:

> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

The included FSS-UTF that's right before the note does include additive constants.

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.

  • Ehhh I view things slightly differently. Overlong encodings are per se illegal, so they cannot encode code points, even if a naive algorithm would consistently interpret them as such.

    I get what you mean, in terms of Postel's Law, e.g., software that is liberal in what it accepts should view 01001000 01100101 01101010 01101010 01101111 as equivalent to 11000001 10001000 11000001 10100101 11000001 10101010 11000001 10101010 11000001 10101111, despite the sequence not being byte-for-byte identical. I'm just not convinced Postel's Law should be applied wrt UTF-8 code units.

    • The context of my comment was (emphasis mine): “lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings”.

      Yes, software shouldn’t accept overlong encodings, and I was pointing out another bad thing that can happen with software that does accept overlong encodings, thereby reinforcing the advice to not accept them.

Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.

I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right