Comment by LukeShu
8 months ago
This doesn't include the oddest of all: sigma.
When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).
That reminds me of the old 'long S'[1] that used to exist in English and survives in the symbol for integration. That worked in a ſimlar way for writing Engliſh, at the ſtart and middle of words you'd use the long s but not at the end so you end up with 'poſſeſs' for 'possess'. There were other rules around it too, I think you'd always use the usual S for a capital.
[1]https://en.wikipedia.org/wiki/Long_s
Not only in English. My local newspaper (in Trondheim, Norway) shows its name as Adresſeaviſen on the front page (in a fractur font to boot).
True! This list could more accurately be described as "Unicode codepoints that expand or contract when case is changed in UTF-8", which is exactly what I was testing in my program. I had built a parser that was relying on some assumptions that I felt was not correct, so I built some tests with this data.
For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):
https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...
Σ now shows up on my Unicode round-trip horror show ;)
https://news.ycombinator.com/item?id=42020476