No way to parse integers in C (2022)

5 hours ago (blog.habets.se)

37 comments

konmok

One of the great virtues of C is that this sort of thing is not part of the language ...

I wasn't in this class myself, but one prof at my alma mater started his "Programming 201" class with the simplest assignment: write a C program that accepts two integers from the user and prints their sum. It actually was the only assignment for the rest of the semester, since he has a test suite that would humiliate the students gently at first, but would ultimately pipe a billion nines into stdin as the first argument.

msie 8 minutes ago

Perfect is the enemy of good.

alexfoo 2 hours ago

I remember an old project that ran into something like this. I think we just used atoi() or similar and the error check was a string comparison between the original input and a sprintf() of the converted value.

Ugly (and not performant if in a hot path) but it works.

bsenftner 3 hours ago

One of the first homework assignments when I learned C back in '83 was after a long lecture on how the string functions are fundamentally broken, and the class introduction to writing C was fixing all of them.

psvv 1 hour ago

My memory growing up is that making your own C library was basically an inevitable rite of passage for any aspiring programmer.

ramon156 2 hours ago

Why not look at how other languages attack this? e.g. how does "42".parse() work in rust?

Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537

interesting! It boils down to this

pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {

    use self::IntErrorKind::*;

    use self::ParseIntError as PIE;

    // guard: radix must be 2..=36

    if 2 > radix || radix > 36 {

        from_ascii_radix_panic(radix);

    }

    if src.is_empty() {

        return Err(PIE { kind: Empty });

    }

    // Strip leading '+' or '-', detect sign

    // (a bare '+' or '-' with nothing after it is an error)

    // accumulate digits, checking for overflow

    Ok(result)

}

marcosdumay 15 minutes ago

It's not an overwhelming hard problem. There are some issues with radix signaling, exponent notation, decimal points being allowed or not, and group separators that make parsing numbers incredibly irritating. So you usually don't want to do it yourself.
But it's not hard at all. It's not even as full of small issues that you can't handle the load, like dates. It's just annoying as hell.
The problem is exclusive to C and C++. It's created by the several rounds of standardization of broken behavior.

zokier 3 hours ago

I thought it was pretty well known that everything related to strings in C stdlib (including all str... functions) is bad. You just need to bring in your own string library.

bhk 8 minutes ago

Not just the string-related functions. If you want robust error checking, re-entrant code, and bounds checking performed in library functions (instead of performing bespoke validations all across your code base), you have some work to do. Yes, some improvements have been tacked on over the years, but many problems ("current locale", for one) remain endemic.
In my experience, the worst part of the C standard library is not its existence, but the fact that so many developers insist on slavishly using it directly, instead of safer wrappers.

voidUpdate 3 hours ago

Cant you just:

  for(int i = 0; i < len(characters); i++)
  {
    if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
    {
      ret = ret * 10 + characters[i] - 48;
    }
    else
    {
      return ERROR;
    }
  }
  return ret;

Adjust until it actually works, but you get the picture.

knome 2 hours ago
this wouldn't catch overflow or underflow errors, nor does it allow non-base-10 numbers, nor does it handle negative numbers. and writing your own parser is a failure case by op's logic. they are complaining about the builtin parsing functions.
the author admits you can parse signed integers in their second example, but for unsigned, they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.
I'm not sure what they mean by "output raw" vs "output"
$ cat t.c #include <stdlib.h> #include <math.h> #include <stdio.h> int main(int argc, char \* argv){ char * enda = NULL; unsigned long long a = strtoull("-18446744073709551614", &enda, 10); printf("in = -18446744073709551614, out = %llu\n", a); char * endb = NULL; unsigned long long b = strtoull("-18446744073709551615", &endb, 10); printf("in = -18446744073709551615, out = %llu\n", b); return 0; } $ gcc t.c $ ./a.out in = -18446744073709551614, out = 2 in = -18446744073709551615, out = 1 $
I get their "output raw" value. I don't know what their "output" value is coming from.
I don't see anywhere they describe what they are representing in the raw vs not columns.
fhdkweig 1 hour ago
What if the number you want to return just happens to be the value of ERROR? You need an error flag that can't be represented as an int, but then C wouldn't let you return it from a function that only returns "int". It is why some languages throw exceptions and why databases have the special "null" value.
- voidUpdate 1 hour ago
  
  I don't use C enough to know what the convention is for throwing an error when the function can return a number anyway. You'd have to ask someone else
  
  1 reply →
- jerf 1 hour ago
  
  And why some very, very special languages have an effectively-global variable called "errno" that you have to check after the call manually, and worry about whether maybe it was populated from some previous error. Nothing says "production-quality language that an entire civilization's code base should be based on" like "sometimes (but only sometimes!) functions return additional information through global values".
  
  1 reply →
Sharlin 2 hours ago
And how does this avoid returning nonsense if the number is too large? (Wrapping if the accumulator is unsigned, straight to UB land if signed.) Not reporting overflows as errors is one of the major problems demonstrated by TFA.
- voidUpdate 2 hours ago
  
  you could check if ret > ret * 10 + characters[i]-48, if so it has wrapped around and you return an error
  
  1 reply →
bitwize 2 hours ago

You cannot "just" anything in C without hitting a minefield of UB. It is, probably, more economical to convert your entire project to Rust than it is to do the pufferfish spine removal procedure of auditing the code base for UB and replacing the problem areas. With generative AI, the size of project for which this remains true may be as large as "the entire Linux kernel".

CodesInChaos 42 minutes ago

Another case many integer parsing functions get wrong is that they interpret a leading 0 as an octal indicator.

That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.

kevin_thibedeau 31 minutes ago

It used to be much more common. In the 70s there was a lot of collective hesitance to use hex with its strange letter digits. Octal was the compact representation of choice.

jervant 2 hours ago

https://man.openbsd.org/strtonum

bmandale 1 hour ago

Interestingly fails as well, in two ways. First:
> The string may begin with an arbitrary amount of whitespace (as determined by isspace(3))
Second is that it only applies to signed long long, not unsigned.

eithed 2 hours ago

Can't you regex that given string contains just numbers and then use any of the provided methods? Then check if the returning value is a number to cater for edge cases

Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want

chadgpt3 2 hours ago

... say users of only language with no way to parse integers.

stephc_int13 3 hours ago

As a C programmer, I find this kind of bad faith article very irritating.

Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.

String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.

C is not the C standard library, ffs.

konmok 2 hours ago
I don't think it's in bad faith.
The distinction between a language and its standard library gets blurry even in theory, and in practice they're nearly inseparable. If a language's standard library has four ways of doing almost the same thing, and they're all fundamentally broken, that's a problem.
- stephc_int13 1 hour ago
  
  If you read the other articles by the same author on his blog, you'll see that he has some strong and weird opinions about C and UB.
  Complete BS in my opinion.
- dosisking 2 hours ago
  
  [flagged]
alexfoo 2 hours ago
Exactly. A wrapper that handles all of the edge cases properly and gives proper reporting just gets added to your own library of functions and the devs get used to using it. Much like the code for abstract data types like lists/hashmaps/etc which neither C nor the standard libraries provide.
Bonus points for having bespoke linting rules to point out the use of known “bad” functions.
In one old project we went through and replaced all instances of sprintf() with snprintf() or equivalent. Once we were happy that we’d got every occurrence we could then add lint rules to flag up any new use of sprintf() so that devs didn’t introduce new possible problems into the code.
(Obviously you can still introduce plenty of problems with snprintf() but we learned to give that more scrutiny.)
- 1718627440 2 hours ago
  
  > like lists/hashmaps/etc which neither C nor the standard libraries provide
  There is a hashmap implementation though: https://man7.org/linux/man-pages/man3/hsearch.3.html
  
  2 replies →
wang_li 2 hours ago
The thing I find irritating is all the folks who say C is broken because it’s not a write once run anywhere language like JavaScript or python. Part of the deal has always been that the programmer needs to understand the target platform and the target compiler’s behavior.
- mswphd 2 minutes ago
  
  isn't the whole point of C that it's portable assembly though? needing to understand the target platform/compiler's behavior to write correct code seems to cut against that claim quite a bit.