Comment by CGamesPlay

4 years ago

And maybe, in a decade or so, the man page for these functions will list their algorithmic complexity! That was the most interesting takeaway from this article, for me at least. I have only seen a one or two libraries that actually list this in their documentation.

All of the C++ algorithms list complexity guarantees, I believe. This saga stunned me to learn that C doesn’t seem to do this.

  • It's easy to forget that the original C standards were largely codifying existing practice during an era when using gets() [1] was existing practice. The world wasn't quite ready for Ada, I guess. Best-laid plans of mice and men etc. etc..

    Also, keep an eye out for "amortized" complexity. This does have a legitimately rigorous definition, but for latency-bound paths it can practically amount to "O(whatever), except for the particular invocations that are far, far worse under unspecified conditions".

    [1] https://www.man7.org/linux/man-pages/man3/gets.3.html#BUGS

    • It's also easy to forget that C was competing mainly with assembly, while C++ competed with managed languages. The early C programmer ethos, especially among library authors, was much more along the lines of "look at the generated object code if you want to know what it's doing" while modern practice leans more towards "read the documentation for complexity guarantees". I'm not saying that worse documentation leads to better programmers, but I'm not not saying that either. Practices change, standards change.

      4 replies →

    • It makes me chuckle when hash maps are stated to be O(1) insertions. Which is true, in respect to the number of items in the map, assuming the map doesn't need resizing and there isn't a hash collision... but it's generally not true in respect to the key length. (I think most implementations are O(ln), where l is the length of the key and n is the number of inserted items, assuming the hash function is O(l) - the _amortised_ runtime would be O(l))

      28 replies →

    • ...I fully plan to use "O(whatever)". Not sure for what.

      But, yes. (naive) Quicksort's amortized complexity being O(nlogn), but its O(n^2) on already sorted data, is all I ever needed to learn to take away that lesson. When sorting already sorted data is worse than sorting randomized data, it's a quick realization that "amortized cost" = "read the fine print".

      12 replies →

  • Are there complexity guarantees for std::istream::operator>>?

  • That's because C is just a wrapper for machine code on a cheap PDP-11, and cheap PDP-11s didn't have enough RAM to do complexity.

The cppreference page linked by the blog post has been changed since: https://en.cppreference.com/w/cpp/io/c/fscanf#Notes

> Note that some implementations of sscanf involve a call to strlen, which makes their runtime linear on the length of the entire string. This means that if sscanf is called in a loop to repeatedly parse values from the front of a string, your code might run in quadratic time

  • Good. I'm so happy they put it there. It's a little thing, but such little things - documenting corner cases - can have great benefits.

    I have a bad memory for all but most frequently used standard library calls, so I regularly end up refreshing my memory from cppreference.com, and I tend to instinctively scan any notes/remarks sections, as there's often critical information there. So now I can be sure I'll be reminded of this the next time I need to use scanf family.

I don't know if it is required to, but there doesn't really seem to be an upper bound to what glibc's scanf will eat for a %f (e.g. a gigabyte of zeroes followed by "1.5" will still be parsed as 1.5), so for that implementation there certainly isn't a trivial upper bound for the amount of input read and processed that is done for %f, like you would perhaps expect.

Yet another reason to not stringify floats. Just use hexfloats (but beware of C++ standard bugs involving >>) or binary.

Unfortunately "gigabytes of numerical data, but formatted as a text file" is commonplace. For some reason HDF5 is far less popular than it ought to be.

  • But why do strlen() at all? And why are all platforms (Linux, Windows, MacOS) seemingly doing that?

    I think you're right that there is no upper bound but it shouldn't be necessary to do a full strlen() if you're instead scanning incremental. You could go char by char until the pattern '%f' is fullfilled and then return. That would solve the issue on it's root -- and who know how many programs would suddenly get faster...

    So looking at glibc souces I've found the culprit in abstraction. Looks like a FILE* like stringbuffer object is created around the c-string: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-commo...

    And that abstraction does call something similiar to strlen() when initializing to know it's bounds here: https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strop...

    I'm reading this source the first time, but I guess to not break anything one could introduce a new type of FILE* stringbuffer let's say in 'strops_incr.c' that is working incrementally reading one char at the time from the underlying string skipping the strlen()...

    Would be awesome cool if GTA online would be loading faster under wine than on windows :-)

How would it help you knowing that it's O(n)? It needs to read all the characters of the float. Problem is that it's needlessly reading characters even after the float