← Back to context

Comment by oneshtein

16 hours ago

Yep, but you decided to abuse Latin alphabet instead of creating your own code page with your own letters and with your own rules.

We created our own letters and our own rules. In 1928, long before code pages and computers.

The assumption that letters come in universal pairs is wrong. That assumption is the bug. You can’t assume that capitalization rules must be the same for every language implementing a specific alphabet. Those rules may change for every language. They do.

And not just capitalization rules. Auto complete, for instance, should respect the language as well. You can’t “correct” a French word to an English word. Localization is not optional when dealing with text.

  • Do all the letters have separate unicode codepoints? (no reuse Latin ones?)

    • There are the following codepoints:

          U+0049 I LATIN CAPITAL LETTER I
          U+0069 i LATIN SMALL LETTER I
          U+0130 İ LATIN CAPITAL LETTER I WITH DOT ABOVE
          U+0131 ı LATIN SMALL LETTER DOTLESS I
      

      While the names of the first two don't explicitly state that they should be dotless and dotted, respectively, the Unicode standard section on the block containing those two [0] does contrast them with the dotted and dotless versions, at least implying that they should be rendered dotless and dotted, respectively.

      Unicode has historically been against adding a separate codepoint for every single language's orthography when the glyphs are (almost) identical to an existing one ("allographs"). Controversy arose when the consortium proposed considering Han characters, which do have language variants, to be allographs, which led to what is known as "Han unification".

      [0]: https://www.unicode.org/charts/PDF/U0000.pdf