Comment by jeroenhd

4 months ago

The issue is actually quite easy to solve by specifying a default locale for string operations when you are not dealing with user input. Whether you pick US or ROOT or Turkish as a default locale, all you need to do is make sure that your fancy metaprogramming tricks relying on strings-as-enums are all parsed the same way. Locale.ROOT for Java, InvariantCulture or ToUpperInvariant() for C#, you name it.

The whole problem is that the compiler has no idea about the locale of any strings in the system, that's why it's on the programmer to specify them.

Lowercasing/uppercasing a string takes an (infuriatingly) optional locale parameter, and the moment that gets involved, you should think twice before using it for anything other than user data processing. I would happily see Oracle deprecate all string operations lacking a locale in the next version of Java.

2 comments

jeroenhd

troad 4 months ago

> actually quite easy to solve

I cannot square your earlier assertion that we should be more mindful "that not everybody writes in English", with your current assertion that all code must only ever contain English, for simplicity's sake. Either is a cogent position on its own, just not both at the same time.

This bug arose because the programmers made incorrect assumptions about the result of a case-changing operation. If you impose English case rules on Turkish symbol names, this exact bug would simply arise in reverse.

More problematically, as I alluded to earlier, Turkish code may contain a mix of languages. It may, for example, be using a DSL to talk to a database with fields named in Turkish, as well as making calls to standard library functions named in English. Which half of the code is your proposed invariant locale going to break?