← Back to context

Comment by Macha

6 months ago

It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.

  • Uppercase ẞ exists since 2017, so before that using SS as a replacement was the correct way of doing things. That is relatively recent wh3n it comes tonthat kind of change

This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.

Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.

There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.

When do you think that first mistake happened?

(Pick a year, then think about why it didn't happen in that year.)

  • When Unicode was being specced out originally I guess. There was more interest in unifying characters at that stage (see also the far more controversial Han unification)

    • Uh-huh. At that time roundtrip compatiblity with all widely used 8-bit encodings was a major design criterion. Roundtrip meaning that you could take an input string in e.g. iso 8859-9, convert it to unicode, convert it back, and get the same string, still usable for purposes like database lookups. Would you have argued to break database lookups at the time?

      3 replies →