Comment by Nitramp
5 years ago
- do not implement your own JSON parser (I mean, really?).
- if you do write a parser, do not use scanf (which is complex and subtle) for parsing, write a plain loop that dispatches on characters in a switch. But really, don't.
I think sscanf is subtle because when you think about what it does (for a given format string), it’s reasonably straightforward. The code in question did sscanf("%d", ...), which you read as “parse the digits at the start of the string into a number,” which is obviously linear. The subtlety is that sscanf doesn’t do what you expect. I think that “don’t use library functions that don’t do what you expect” is impossible advice.
I don’t use my own json parser but I nearly do. If this were some custom format rather than json and the parser still used sscanf, the bug would still happen. So I think json is somewhat orthogonal to the matter.
> The code in question did sscanf("%d", ...), which you read as “parse the digits at the start of the string into a number,” which is obviously linear.
I think part of the problem is that scanf has a very broad API and many features via its format string argument. I assume that's where the slowdown comes from here - scanf needs to implement a ton of features, some of which need the input length, and the implementor expected it to be run on short strings.
> The subtlety is that sscanf doesn’t do what you expect. I think that “don’t use library functions that don’t do what you expect” is impossible advice.
I don't know, at face value it seems reasonable to expect programmers to carefully check whether the library function they use does what they want it to do? How would you otherwise ever be sure what your program does?
There might be an issue that scanf doesn't document it's performance well. But using a more appropriate and tighter function (atoi?) would have avoided the issue as well.
Or, you know, don't implement your own parser. JSON is deceptively simple, but there's still enough subtlety to screw things up, qed.
But sscanf does do what they want it to do by parsing numbers. The problem is that it also calls strlen. I’m still not convinced that it’s realistically possible to have people very carefully understand the performance characteristics of every function they use.
Every programmer I know thinks about performance of functions either by thinking about what the function is doing and guessing linear/constant, or by knowing what the data structure is and guessing (eg if you know you’re doing some insert operation on a binary tree, guess that it’s logarithmic), or by knowing that the performance is subtle (eg “you would guess that this is log but it needs to update some data on every node so it’s linear”). When you write your own library you can hopefully avoid having functions with subtle performance and make sure things are documented well (but then you also don’t think they should be writing their own library). When you use the C stdlib you’re a bit stuck. Maybe most of the functions there should just be banned from the codebase, but I would guess that would be hard.
> I assume that's where the slowdown comes from here - scanf needs to implement a ton of features, some of which need the input length, and the implementor expected it to be run on short strings.
I didn't get that impression. It sounded like the slowdown comes from the fact that someone expected sscanf to terminate when all directives were successfully matched, whereas it actually terminates when either (1) the input is exhausted; or (2) a directive fails. There is no expectation that you run sscanf on short strings; it works just as well on long ones. The expectation is that you're intentionally trying to read all of the input you have. (This expectation makes a little more sense for scanf than it does for sscanf.)
The scanf man page isn't very clear, but it looks to me like replacing `sscanf("%d", ...)` with `sscanf("%d\0", ...)` would solve the problem. "%d" will parse an integer and then dutifully read and discard the rest of the input. "%d\0" will parse an integer and immediately fail to match '\0', forcing a termination.
EDIT: on my xubuntu install, scanf("%d") does not clear STDIN when it's called, which conflicts with my interpretation here.
2 replies →
> If this were some custom format rather than json and the parser still used sscanf, the bug would still happen. So I think json is somewhat orthogonal to the matter.
What's the point of using standard formats if you're not taking advantage of off-the-shelf software for handling it?
This is probably good advice but not even relevant. It's down one level from the real problem: when your game spends 6 minutes on a loading screen, *profile* the process first. You can't optimize what you haven't measured. Now, if you've identified that JSON parsing is slow, you can start worrying about how to fix that (which, obviously, should be "find and use a performant and well-tested library".)