Comment by LukeShu

8 months ago

This doesn't include the oddest of all: sigma.

When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).

4 comments

LukeShu

BoxOfRain 8 months ago

That reminds me of the old 'long S'[1] that used to exist in English and survives in the symbol for integration. That worked in a ſimlar way for writing Engliſh, at the ſtart and middle of words you'd use the long s but not at the end so you end up with 'poſſeſs' for 'possess'. There were other rules around it too, I think you'd always use the usual S for a capital.

[1]https://en.wikipedia.org/wiki/Long_s

hanche 8 months ago

Not only in English. My local newspaper (in Trondheim, Norway) shows its name as Adresſeaviſen on the front page (in a fractur font to boot).

Rendello 8 months ago

True! This list could more accurately be described as "Unicode codepoints that expand or contract when case is changed in UTF-8", which is exactly what I was testing in my program. I had built a parser that was relying on some assumptions that I felt was not correct, so I built some tests with this data.

For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):

https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...

Rendello 8 months ago

Σ now shows up on my Unicode round-trip horror show ;)

https://news.ycombinator.com/item?id=42020476