Comment by Arnt

6 months ago

Uh-huh. At that time roundtrip compatiblity with all widely used 8-bit encodings was a major design criterion. Roundtrip meaning that you could take an input string in e.g. iso 8859-9, convert it to unicode, convert it back, and get the same string, still usable for purposes like database lookups. Would you have argued to break database lookups at the time?

ISO-8859-9 actually does have what I suggest:

FD/49 are lower/upper dotless ı/I

DD/69 are upper/lower dotted İ/i.

There's nothing around the capability to round trip that through unicode that required 49 in ISO-8859-9 to be assigned the same unicode codepoint as 49 in ISO-8859-1 because they happen to be visually identical

  • There is a reason: ISO-8859-9 is an extended ASCII character set. The shared characters are not an accident, they are by definition the same characters. Most ISO character sets follow a specific template with fixed ranges for shared and custom characters. Interpreting that i as anything special would violate the spec.

    • In practical terms:

      Back in those days, people would store a mixture of ASCII and other data in the same database, e.g. ASCII in some rows, ISO-8859-9 in others. (My bank at the time did that, some customers had all-ASCII names, some had names with ø and so on.) If unicode were only mostly compatible with the combination, it wouldn't have been safe to start migrating software that accessed databases/servers/… For example, using UTF8 for display and a database's encoding to access a DBMS would have had difficult-to-understand limitations.

      You can fix all kinds of bugs if you're able to disregard compatibility with old data or old systems. But you can't. And that's why unicode is constrained by e.g. the combination of a decision made in Sweden hundreds of years ago with one made in Germany around the same time. Compatibility with both leads to nontrivial choices and complexity, incompatibility leads to the scrap heap of software.