Comment by Nitramp

5 years ago

> The code in question did sscanf("%d", ...), which you read as “parse the digits at the start of the string into a number,” which is obviously linear.

I think part of the problem is that scanf has a very broad API and many features via its format string argument. I assume that's where the slowdown comes from here - scanf needs to implement a ton of features, some of which need the input length, and the implementor expected it to be run on short strings.

> The subtlety is that sscanf doesn’t do what you expect. I think that “don’t use library functions that don’t do what you expect” is impossible advice.

I don't know, at face value it seems reasonable to expect programmers to carefully check whether the library function they use does what they want it to do? How would you otherwise ever be sure what your program does?

There might be an issue that scanf doesn't document it's performance well. But using a more appropriate and tighter function (atoi?) would have avoided the issue as well.

Or, you know, don't implement your own parser. JSON is deceptively simple, but there's still enough subtlety to screw things up, qed.

4 comments

Nitramp

dan-robertson 5 years ago

But sscanf does do what they want it to do by parsing numbers. The problem is that it also calls strlen. I’m still not convinced that it’s realistically possible to have people very carefully understand the performance characteristics of every function they use.

Every programmer I know thinks about performance of functions either by thinking about what the function is doing and guessing linear/constant, or by knowing what the data structure is and guessing (eg if you know you’re doing some insert operation on a binary tree, guess that it’s logarithmic), or by knowing that the performance is subtle (eg “you would guess that this is log but it needs to update some data on every node so it’s linear”). When you write your own library you can hopefully avoid having functions with subtle performance and make sure things are documented well (but then you also don’t think they should be writing their own library). When you use the C stdlib you’re a bit stuck. Maybe most of the functions there should just be banned from the codebase, but I would guess that would be hard.

thaumasiotes 5 years ago

> I assume that's where the slowdown comes from here - scanf needs to implement a ton of features, some of which need the input length, and the implementor expected it to be run on short strings.

I didn't get that impression. It sounded like the slowdown comes from the fact that someone expected sscanf to terminate when all directives were successfully matched, whereas it actually terminates when either (1) the input is exhausted; or (2) a directive fails. There is no expectation that you run sscanf on short strings; it works just as well on long ones. The expectation is that you're intentionally trying to read all of the input you have. (This expectation makes a little more sense for scanf than it does for sscanf.)

The scanf man page isn't very clear, but it looks to me like replacing `sscanf("%d", ...)` with `sscanf("%d\0", ...)` would solve the problem. "%d" will parse an integer and then dutifully read and discard the rest of the input. "%d\0" will parse an integer and immediately fail to match '\0', forcing a termination.

EDIT: on my xubuntu install, scanf("%d") does not clear STDIN when it's called, which conflicts with my interpretation here.

JdeBP 5 years ago
No it would not. Think about what the function would see as its format string in both cases.
The root cause here isn't formatting or scanned items. It is C library implementations that implement the "s" versions of these functions by turning the input string into a nonce FILE object on every call, which requires an initial call to strlen() to set up the end of read buffer point. (C libraries do not have to work this way. Neither P.J. Plauger's Standard C library nor mine implement sscanf() this way. I haven't checked Borland's or Watcom's.)
See https://news.ycombinator.com/item?id=24460852 .
- pja 5 years ago
  
  Yes, it looks that way. On the unix/linux side of things, glibc also implements scanf() by converting to a FILE* object, as does the OpenBSD implementation.
  It looks like this approach is taken by the majority of sscanf() implementations!
  I honestly would not personally have expected sscanf() to implicitly call strlen() on every call.