Comment by jeroenhd

1 day ago

Computers and localisation weren't relevant back in the early 20th century. The dotless existed before the dotted i (in Greek script as iota). Some European scholars putting an extra dot on the letter to make it stand out a bit more are as much to blame as the Turks for making the distinction between the different i-vowels clear.

Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.

6 comments

jeroenhd

JuniperMesos 21 hours ago

It's not exactly programmers failing to take into account that no everybody writes in English - if that were the case, then it would simply be impossible to represent the Turkish lowercase-dotless and uppercase-dotted I at all. The actual problem is failing to take into account that operations on text strings that work in one language's writing might not work the same way in a different language's writing system. There's a lot of languages in the world that use the Latin writing system, and even if you are personally a fluent speaker and writer of several of them, you might simply have not learned about Turkish's specific behavior with I.

jagrsw 1 day ago

> that not everybody writes in English.

I don't know... I understand the history and reasons for this capitalization behavior in Turkish, and my native language isn't English, which had to use a lot of strange encodings before the introduction of UTF-8.

But messing around with the capitalization of ASCII <= codepoint(127) is a risky business, in my opinion. These codepoints are explicitly named:

"LATIN CAPITAL LETTER I" "LATIN SMALL LETTER I"

and requiring them to not match exactly during capitalization/diminuitization sounds very risky.

troad 18 hours ago

> Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.

This bug is the exact opposite of that. The program would have worked fine had it used pure ASCII transforms (±0x20); it was the use of library functions that did in fact take Turkish into account that caused the problem.

More broadly, this is not an easy issue to solve. If a Turkish programmer writes code, what is the expected behaviour for metaprogramming and compilers? Are the function names in English or Turkish? What about variables, object members, struct fields? You could have one variable name that references some government ID number using its native Turkish name, right next to another variable name that uses the English "ID". How does the compiler know what locale to use for which symbol?

Boiling all of this down to 'just be more considerate' is not actually constructive or actionable.

jeroenhd 8 hours ago
The issue is actually quite easy to solve by specifying a default locale for string operations when you are not dealing with user input. Whether you pick US or ROOT or Turkish as a default locale, all you need to do is make sure that your fancy metaprogramming tricks relying on strings-as-enums are all parsed the same way. Locale.ROOT for Java, InvariantCulture or ToUpperInvariant() for C#, you name it.
The whole problem is that the compiler has no idea about the locale of any strings in the system, that's why it's on the programmer to specify them.
Lowercasing/uppercasing a string takes an (infuriatingly) optional locale parameter, and the moment that gets involved, you should think twice before using it for anything other than user data processing. I would happily see Oracle deprecate all string operations lacking a locale in the next version of Java.
- troad 8 hours ago
  
  > actually quite easy to solve
  I cannot square your earlier assertion that we should be more mindful "that not everybody writes in English", with your current assertion that all code must only ever contain English, for simplicity's sake. Either is a cogent position on its own, just not both at the same time.
  This bug arose because the programmers made incorrect assumptions about the result of a case-changing operation. If you impose English case rules on Turkish symbol names, this exact bug would simply arise in reverse.
  More problematically, as I alluded to earlier, Turkish code may contain a mix of languages. It may, for example, be using a DSL to talk to a database with fields named in Turkish, as well as making calls to standard library functions named in English. Which half of the code is your proposed invariant locale going to break?