If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.
There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.
In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.
UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.
"unicode codepoint sequences whose codepoint lengths and/or utf8-code-unit-lengths behave oddly when you change their case" would not fit in a HN title, however
If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.
There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.
In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.
The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense
It's both.
UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.
"unicode codepoint sequences whose codepoint lengths and/or utf8-code-unit-lengths behave oddly when you change their case" would not fit in a HN title, however
1 reply →