Comment by Wowfunhappy
6 months ago
So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?
I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.
My understanding is it's a bug that the case changes don't round trip correctly, in part due to questionable Unicode design that made the upper and lower case operations language dependent.
This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.
https://stackoverflow.com/questions/48067545/why-does-unicod...
Ah, I see the problem now!
In Turkish:
• Lowercase dotted I ("i") maps to uppercase dotted I ("İ")
• Lowercase dotless I ("ı") maps to uppercase dotless I ("I")
In English, uppercase dotless I ("I") maps to lowercase dotted I ("i"), because those are the only kinds we have.
Ew! So it's a conflict of language behavior. There's no "correct" way to handle this unless you know which language is currently in use!
Even if you were to start over, I'm not convinced that using different unicode point points would have been the right solution since the rest of the alphabet is the same.
Me and someone else in the thread ran into the same string searching bug with that same character:
https://news.ycombinator.com/item?id=42016936
yup. lowercase and uppercase operations depend on language. It's rough.
In some apis this distinction shows through - e.g. javascript's Intl.Collator is a language-aware sorting interface in JS.
In practice, the best bet is usually to try to not do any casing conversions and just let the users handle uppercase vs lowercase on their own. But if you have to do case-insensitive operations, lots more questions about which normalization you should use, and if you want to match user intuition you are going to want to take the language of the text into consideration.
Yeah, making a specific "Turkish lowercase dotted i" character which looks and behaves exactly like the regular i except for uppercasing feels like introducing even more unexpected situations (and also invites the next homograph attack)
I guess it's a general situation: If you have some data structure which works correctly for 99.99% of all cases, but there is one edge case that cannot be represented correctly, do you really want to throw out the whole data structure?