← Back to context

Comment by Retr0id

6 months ago

It's both.

UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.

  • "unicode codepoint sequences whose codepoint lengths and/or utf8-code-unit-lengths behave oddly when you change their case" would not fit in a HN title, however

    • I (OP) said above that "Unicode codepoints that expand or contract when case is changed in UTF-8" would have worked fine, I've changed the Gist title to that in any case. I'm curious if it would've affected the attention it received on HN.