Comment by jchw

4 years ago

I can't really understand what you mean. You can validate yourself that parsing with strtod will break if your system's locale is set to a locale where the decimal separator is a comma ',' instead of a period '.' - as an example, most European locales. Whether or not strtod will try to magically fall back to "C" locale behavior is irrelevant because it is ambiguous. For example, what do you do if you are in Germany and you try to parse 100.001? Is it 100001?

strtod also doesn't guarantee round-trip accuracy that you can achieve when you use Steele & White and Clinger. All in all, I really think it is just not a good idea to use the C standard library for string operations.

2 comments

jchw

kazinator 4 years ago

Sorry about that; I see the next now in the description of strtod where it is required that the datum is interpreted like a floating-point constant in C, except that instead of the period character, the decimal point character is recognized.

Yikes!

There is a way to fix that, other than popping into the "C" locale, which is to look up that decimal character in the current locale, and substitute it into the string that is being fed to strtod. That's a fairly unsavory piece of code to have to write.

What locale issues did you find for strol{l,ul,ull}?

> I really think it is just not a good idea to use the C standard library for string operations.

I really despise locale-dependent behavior also, and try to avoid it.

If you control the entire program, you can be sure nothing calls setlocale, but if you are writing middleware code, you can't assume anything.

Also in POSIX environments, the utilities react to localization environment variables; they do call setlocale. And so much stuff depends on it, like for instance operators in regex! [A-Z] does not refer to 26 letters; what [A-Z] denotes in a POSIX regex depends on the collation order of the character set.

There are functions in the C library without locale-specific behaviors, though, like strchr, strpbrk, strcpy, and their wide character counterparts.

jchw 4 years ago

> There are functions in the C library without locale-specific behaviors, though, like strchr, strpbrk, strcpy, and their wide character counterparts.
Obviously these string functions are totally fine if you use them correctly, and their implementations should be fairly optimal. However, I do think they put a lot of onus on the programmer to be very careful.
For example, using strncpy to avoid buffer overflows is an obvious trap, since it doesn’t terminate a string if it overflows... strlcpy exists to help deal with the null termination issue, but it still has the failure mode of truncating on overflow, which can obviously lead to security issues if one is not careful. strcpy_s exists in C11 and Microsoft CRT, though I believe Microsoft’s “secure” functions work differently from C11’s. These functions are a bit better because they fail explicitly on truncation and clobber the destination.
OpenBSD arguably has some of the best security track record of all C projects and I still feel weary about their preferred mechanism for string copying and concatenation (strlcpy and strlcat.) I feel strlcpy and strlcat are both prone to errors if the programmer is not careful to avoid security and correctness issues caused by truncation, and strlcat has efficiency issues in many non-trivial use cases.
While there are obvious cases where dynamically allocated strings of arbitrary length, such as those seen in C++, Rust, Go, etc. can lead to security issues, especially DoS issues, I still feel they are a good foundation to build on because they are less prone to correctness issues that can lead to more serious problems. Whether you are in C or not, you will always need to set limits on inputs to avoid DoS issues (even unintentional ones) so I feel less concerned about the problems that come with strings that grow dynamically.
One of the biggest complaints about prefix-sized strings/Pascal style strings is that you can’t point into the string to get a suffix of the original string. However, in modern programming languages this is alleviated by making not only dynamic strings a primitive, but also string slices. (Even modern C++, with its string_view class.) String slices are even more powerful, since they can specify any range in a string, not just suffixes.
So really locale and strtod are just little microcosms of why I am weary of C string handling. Clearly you can write code using C string functions that is efficient, secure and correct. However, I feel like there are plenty of pitfalls for all three that even experienced programmers have trouble avoiding sometimes. I don’t actually know of a case where locale can break strtol, but it doesn’t matter too much, since anyone can write a decent strtol implementation (as long as they test the edge cases carefully...) strtod though, is not so easy, and I guess that means apps are best off avoiding locales other than C. In a library though, there’s not much you can do about it. At least not without causing thread safety issues :) In other languages, aside from dynamic strings and string slices, locale independent string functions is also typically the default. Rust’s f64::from_str, Go’s strconv.ParseFloat, C++’s std::from_chars and so forth. It’s not too surprising since a lot of the decisions made in these languages were specifically made from trying to improve on C pitfalls. I do wish C itself would also consider at least adding some locale independent string functions for things like strtod in a future standard...