Comment by jchw

4 years ago

It is a good opportunity to mention that you should not use strtod/strtol either if you can help it, since they are impacted by locale. Exactly what to use instead is a bit of a tough nut to crack; you could extract musl’s floatscan code, or implement the Clinger algorithm yourself. Or, of course, use programming languages that have a more reasonable option in the standard library...

5 comments

jchw

kazinator 4 years ago

I see you are reiterating this point raised in the previous discussion several days ago, but I don't thing it is particularly well grounded.

ISO C allows strtod and strtol to accept, other than in the "C" locale, additional "subject sequence forms".

This does not affect programming language implementations which extract specific token patterns from an input stream, which either are, or are transformed into the portable forms supported by these functions.

What the requirement means is that the functions cannot be relied on to reject inputs that are outside of their description. Those inputs could accidentally match some locale-dependent representations.

You must do your own rejecting.

So for instance, if an integer token is a sequence of ASCII digits with an optional + or - sign, ensured by your lexical analyzer's regex, you can process that with strtol without worry about locale-dependent behavior.

Basically, rely on the functions only for conversion, and feed them only the portable inputs.

jchw 4 years ago
I can't really understand what you mean. You can validate yourself that parsing with strtod will break if your system's locale is set to a locale where the decimal separator is a comma ',' instead of a period '.' - as an example, most European locales. Whether or not strtod will try to magically fall back to "C" locale behavior is irrelevant because it is ambiguous. For example, what do you do if you are in Germany and you try to parse 100.001? Is it 100001?
strtod also doesn't guarantee round-trip accuracy that you can achieve when you use Steele & White and Clinger. All in all, I really think it is just not a good idea to use the C standard library for string operations.
- kazinator 4 years ago
  
  Sorry about that; I see the next now in the description of strtod where it is required that the datum is interpreted like a floating-point constant in C, except that instead of the period character, the decimal point character is recognized.
  Yikes!
  There is a way to fix that, other than popping into the "C" locale, which is to look up that decimal character in the current locale, and substitute it into the string that is being fed to strtod. That's a fairly unsavory piece of code to have to write.
  What locale issues did you find for strol{l,ul,ull}?
  > I really think it is just not a good idea to use the C standard library for string operations.
  I really despise locale-dependent behavior also, and try to avoid it.
  If you control the entire program, you can be sure nothing calls setlocale, but if you are writing middleware code, you can't assume anything.
  Also in POSIX environments, the utilities react to localization environment variables; they do call setlocale. And so much stuff depends on it, like for instance operators in regex! [A-Z] does not refer to 26 letters; what [A-Z] denotes in a POSIX regex depends on the collation order of the character set.
  There are functions in the C library without locale-specific behaviors, though, like strchr, strpbrk, strcpy, and their wide character counterparts.
  
  1 reply →