← Back to context

Comment by LegionMammal978

6 months ago

A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.

It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

  • But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.

    • Uppercase ẞ exists since 2017, so before that using SS as a replacement was the correct way of doing things. That is relatively recent wh3n it comes tonthat kind of change

  • This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.

    Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.

    There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.

  • When do you think that first mistake happened?

    (Pick a year, then think about why it didn't happen in that year.)

    • When Unicode was being specced out originally I guess. There was more interest in unifying characters at that stage (see also the far more controversial Han unification)

      4 replies →

So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?

I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.

  • My understanding is it's a bug that the case changes don't round trip correctly, in part due to questionable Unicode design that made the upper and lower case operations language dependent.

    This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.

    https://stackoverflow.com/questions/48067545/why-does-unicod...

    • Ah, I see the problem now!

      In Turkish:

      • Lowercase dotted I ("i") maps to uppercase dotted I ("İ")

      • Lowercase dotless I ("ı") maps to uppercase dotless I ("I")

      In English, uppercase dotless I ("I") maps to lowercase dotted I ("i"), because those are the only kinds we have.

      Ew! So it's a conflict of language behavior. There's no "correct" way to handle this unless you know which language is currently in use!

      Even if you were to start over, I'm not convinced that using different unicode point points would have been the right solution since the rest of the alphabet is the same.

      3 replies →