Comment by Macha

6 months ago

When Unicode was being specced out originally I guess. There was more interest in unifying characters at that stage (see also the far more controversial Han unification)

Uh-huh. At that time roundtrip compatiblity with all widely used 8-bit encodings was a major design criterion. Roundtrip meaning that you could take an input string in e.g. iso 8859-9, convert it to unicode, convert it back, and get the same string, still usable for purposes like database lookups. Would you have argued to break database lookups at the time?

  • ISO-8859-9 actually does have what I suggest:

    FD/49 are lower/upper dotless ı/I

    DD/69 are upper/lower dotted İ/i.

    There's nothing around the capability to round trip that through unicode that required 49 in ISO-8859-9 to be assigned the same unicode codepoint as 49 in ISO-8859-1 because they happen to be visually identical

    • There is a reason: ISO-8859-9 is an extended ASCII character set. The shared characters are not an accident, they are by definition the same characters. Most ISO character sets follow a specific template with fixed ranges for shared and custom characters. Interpreting that i as anything special would violate the spec.

      1 reply →