Comment by hyperman1
9 hours ago
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of
Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?
I suspect the answer is
a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings
b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction
You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:
> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.
The included FSS-UTF that's right before the note does include additive constants.
[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.
Ehhh I view things slightly differently. Overlong encodings are per se illegal, so they cannot encode code points, even if a naive algorithm would consistently interpret them as such.
I get what you mean, in terms of Postel's Law, e.g., software that is liberal in what it accepts should view 01001000 01100101 01101010 01101010 01101111 as equivalent to 11000001 10001000 11000001 10100101 11000001 10101010 11000001 10101010 11000001 10101111, despite the sequence not being byte-for-byte identical. I'm just not convinced Postel's Law should be applied wrt UTF-8 code units.
1 reply →
Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.
I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right
I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.
In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.
See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.
That would make the calculations more complicated and a little slower. Now you can do a few quick bit shifts. This was more of an issue back in the '90s when UTF-8 was designed and computers were slower.
https://en.m.wikipedia.org/wiki/Self-synchronizing_code
Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.
I think that would garble random access?