UTF-8 characters that behave oddly when the case is changed

6 months ago (gist.github.com)

I generated this list from a Python script I wrote a few months back for use in property tests in a Rust codebase. Its meant to break parsers that make bad assumptions about UTF-8, like assuming that upper- or lowercasing a character will always result in the character encoding having the same size in bytes, or even that it will result in one character.

The "ff" ligature, for example, is uppercased by Python as "FF", meaning it both becomes two separate characters, and is one byte smaller overall. I hope it's interesting.

  • Thanks for the insight. I had never considered this even though i researched quite some oddities in UTF-8 parsing myself over the years. It's the gift that keeps on giving when it comes to ways to breaking things in software, i find. Time to go over my code again.

  • The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?

    Not sure I wanted to know...

    • Indeed, the parent already gives one: flip_case(flip_case("ff")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ff' ligature and the second is two 'f's.)

    • > Not sure I wanted to know...

      Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.

      Bruce Schneier in 2000 on Unicode security risks:

      https://www.schneier.com/crypto-gram/archives/2000/0715.html...

      Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.

      My favorite line from Schneier: "Unicode is just too complex to ever be secure".

      And, no matter if you love Unicode or not, there's lots of wisdom in there.

      When design-by-committee gives birth to something way too complex, insecurity is never far behind.

      5 replies →

This doesn't include the oddest of all: sigma.

When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).

  • That reminds me of the old 'long S'[1] that used to exist in English and survives in the symbol for integration. That worked in a ſimlar way for writing Engliſh, at the ſtart and middle of words you'd use the long s but not at the end so you end up with 'poſſeſs' for 'possess'. There were other rules around it too, I think you'd always use the usual S for a capital.

    [1]https://en.wikipedia.org/wiki/Long_s

    • Not only in English. My local newspaper (in Trondheim, Norway) shows its name as Adresſeaviſen on the front page (in a fractur font to boot).

  • True! This list could more accurately be described as "Unicode codepoints that expand or contract when case is changed in UTF-8", which is exactly what I was testing in my program. I had built a parser that was relying on some assumptions that I felt was not correct, so I built some tests with this data.

    For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):

    https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...

In one of my work projects it was the Turkish İ that gave us trouble. In some case-insensitive text searching code, we matched the lowercase query against the lowercase text, and had to handle cases like that specially to avoid reporting the wrong matching span in the original text, since the lowercase string would have a different length than the uppercase string. This was one of my first real-world projects and opened my eyes a bit to the importance of specifications and standards.

This isn't "odd" behavior. It's a consequence of using a multibyte encoding scheme. Also, when dealing with case mapping, you can't assume that the character count will remain constant. This is because in Unicode full case mappings can map a character to multiple characters, meaning you might end up with more characters than you started with, regardless of the encoding used.

It's not UTF-8 characters but Unicode.

  • If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.

    There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.

  • In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.

  • The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense

  • It's both.

    • UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.

      2 replies →