Comment by rmrfchik

8 months ago

It's not UTF-8 characters but Unicode.

7 comments

rmrfchik

If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.

There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.

Rendello 8 months ago

In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.

Aardwolf 8 months ago

The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense

Retr0id 8 months ago

It's both.

zahlman 8 months ago
UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.
- Retr0id 8 months ago
  
  "unicode codepoint sequences whose codepoint lengths and/or utf8-code-unit-lengths behave oddly when you change their case" would not fit in a HN title, however
  
  1 reply →