UTF-8 characters that behave oddly when the case is changed

9 months ago (gist.github.com)

65 comments

Rendello

I generated this list from a Python script I wrote a few months back for use in property tests in a Rust codebase. Its meant to break parsers that make bad assumptions about UTF-8, like assuming that upper- or lowercasing a character will always result in the character encoding having the same size in bytes, or even that it will result in one character.

The "ﬀ" ligature, for example, is uppercased by Python as "FF", meaning it both becomes two separate characters, and is one byte smaller overall. I hope it's interesting.

slome 9 months ago

Thanks for the insight. I had never considered this even though i researched quite some oddities in UTF-8 parsing myself over the years. It's the gift that keeps on giving when it comes to ways to breaking things in software, i find. Time to go over my code again.
xg15 9 months ago
The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?
Not sure I wanted to know...
- LegionMammal978 9 months ago
  
  A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.
  
  20 replies →
- Rendello 9 months ago
  
  I know Halloween was yesterday but let's discover this horror together with some terrifying Python[1]! Turns out, yep.
  For upper → lower → upper we have:
  Ω ω Ω
  İ i̇ İ
  K k K
  Å å Å
  ẞ ß SS
  ϴ θ Θ
  For lower → upper → lower there are a lot more.
  1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...
  
  5 replies →
- JadeNB 9 months ago
  
  Indeed, the parent already gives one: flip_case(flip_case("ﬀ")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ﬀ' ligature and the second is two 'f's.)
- automatic6131 9 months ago
  
  > flip_case(flip_case(x)) == x
  Falsehoods programmers believe about strings...
- TacticalCoder 9 months ago
  
  > Not sure I wanted to know...
  Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.
  Bruce Schneier in 2000 on Unicode security risks:
  https://www.schneier.com/crypto-gram/archives/2000/0715.html...
  Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.
  My favorite line from Schneier: "Unicode is just too complex to ever be secure".
  And, no matter if you love Unicode or not, there's lots of wisdom in there.
  When design-by-committee gives birth to something way too complex, insecurity is never far behind.
  
  5 replies →
zahlman 9 months ago
Another assumption worth testing is that casing round-trips:
>>> 'ẞ'.lower().upper() 'SS'
- Rendello 9 months ago
  
  I ended up playing with just that in another thread:
  https://news.ycombinator.com/item?id=42020476

LukeShu 9 months ago

This doesn't include the oddest of all: sigma.

When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).

BoxOfRain 9 months ago
That reminds me of the old 'long S'[1] that used to exist in English and survives in the symbol for integration. That worked in a ſimlar way for writing Engliſh, at the ſtart and middle of words you'd use the long s but not at the end so you end up with 'poſſeſs' for 'possess'. There were other rules around it too, I think you'd always use the usual S for a capital.
[1]https://en.wikipedia.org/wiki/Long_s
- hanche 9 months ago
  
  Not only in English. My local newspaper (in Trondheim, Norway) shows its name as Adresſeaviſen on the front page (in a fractur font to boot).
Rendello 9 months ago

True! This list could more accurately be described as "Unicode codepoints that expand or contract when case is changed in UTF-8", which is exactly what I was testing in my program. I had built a parser that was relying on some assumptions that I felt was not correct, so I built some tests with this data.
For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):
https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...
Rendello 9 months ago

Σ now shows up on my Unicode round-trip horror show ;)
https://news.ycombinator.com/item?id=42020476

throwaway173920 9 months ago

In one of my work projects it was the Turkish İ that gave us trouble. In some case-insensitive text searching code, we matched the lowercase query against the lowercase text, and had to handle cases like that specially to avoid reporting the wrong matching span in the original text, since the lowercase string would have a different length than the uppercase string. This was one of my first real-world projects and opened my eyes a bit to the importance of specifications and standards.

pavel_lishin 9 months ago
Can't mention the Turkish case situation without mentioning the actual murder that took place because of it: https://languagelog.ldc.upenn.edu/nll/?p=73
- Filligree 9 months ago
  
  The murder is a tragedy, of course, but I would hesitate to blame the cellphone. There’s overreactions, and then there’s… this.
  
  2 replies →
johannes1234321 9 months ago

In PHP the Turkish locale caused quite some trouble. In some situations a different locale was used for compiling and for runtime while handling "case-insensiteve" identifiers, fo that sometimes names with an "I" could not be found anymore.
Rendello 9 months ago

I had this exact bug with the same character:
https://github.com/rendello/layout/issues/8#issuecomment-235...

D-Coder 9 months ago

Raymond Chen's "Old New Thing" blog just commented on a similar issue: What has case distinction but is neither uppercase nor lowercase?

https://devblogs.microsoft.com/oldnewthing/20241031-00/?p=11...

hgs3 9 months ago

This isn't "odd" behavior. It's a consequence of using a multibyte encoding scheme. Also, when dealing with case mapping, you can't assume that the character count will remain constant. This is because in Unicode full case mappings can map a character to multiple characters, meaning you might end up with more characters than you started with, regardless of the encoding used.

Rendello 9 months ago

That's exactly right. My comment here is related:
https://news.ycombinator.com/item?id=42018937

rmrfchik 9 months ago

It's not UTF-8 characters but Unicode.

jonhohle 9 months ago

If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.
There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.
Rendello 9 months ago

In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.
Aardwolf 9 months ago

The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense
Retr0id 9 months ago
It's both.
- zahlman 9 months ago
  
  UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.
  
  2 replies →

layer8 9 months ago

The canonical source data for this is https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, by the way.

zzo38computer 9 months ago

I wrote another comment relating to case folding: https://news.ycombinator.com/item?id=41784627