← Back to context

Comment by decasia

4 years ago

It would be nice if it were more common for standard library functions to include algorithmic complexity as part of the standard documentation.

Absent that, of course we can potentially read the source code and find out, but I think for the most part we tend to operate based on an informed assumption about what we imagine the algorithmic complexity of a given operation would be. Inevitably, sometimes the assumption is incorrect.

There's no way to develop software without making assumptions, some of which inevitably turn out to be incorrect, so I don't think there is any great shame in having that happen, in itself. But better docs could help us make better assumptions with less effort, at least.

Keeping track of algorithmic complexity would be nice as a language and/or static analysis feature. If you wanted to be exact or do it for a language with complex metaprogramming I assume it would be a nightmare to implement. Absent those complications and especially if you always reduced it to O(1), O(n), O(log(n)), etc it might not even be that difficult given the potential advantages.

  • The difficulty here is "define n". And I don't mean that facetiously. You have a string parsing lib. It is, for reasons, quadratic over the number of strings parsed, and linear per string.

    This is overall n^3, but that's meaningless because there actually isn't just one n. So, more m^2 * n. That means you can't reduce it to anything, because you want to keep both components. (Because, say, you know it will only ever be called with a single string).

    But then, in the next app, this gets called and reinitialized once per file. And the routine handling files, for reasons beyond our ken, is (n lg n). We're now at k * log(k) * m^2 * n.

    And so, over any sufficiently long call chain, "what is n" is the overriding question - string length, number of strings, number of files? Not "how complex is the algorithm", because you want to optimize for what's relevant to your use case.

    • It would be a huge step to simply follow the call tree and report the depth of nested loops for each branch. You could then check what N is at each level.

      The trick is knowing where the nested loops are since they can be spread across functions.

      I had a function that scaled as N^2, but it was creating a list of that size as well. Then it called a function to remove duplicates from that list. That function was N^2, which meant the whole thing was actually N^4. And now that I think of it, those loops were not nested... I rewrote the first part to no create duplicates and deleted the quadratic deduplication. Now its N^2, but it has to be.

    • I guess you're right. Keeping track of it all is required for the information to be meaningful enough. Still seems doable to me, assuming the functions are pure.

      Here's another crazy idea: keeping track of this while taking into consideration aggressive compiler optimizations.

  • There is no general solution to deciding whether a program is O(n^k).[1] So either your static analysis won't know the answer for some programs or report a wrong bound, or report a ridiculous overestimate.

    [1] https://cstheory.stackexchange.com/questions/5004/are-runtim...

    • So? Static analysis doesn't need to always produce an answer, only produce an answer most of the time. The question isn't whether you can do it in general for all inputs (this is not possible for basically anything you would want to know), it's whether you can do it enough of the time on the kind of code which people actually write.

personally, I think I wouldn't even bother to check the algorithmic complexity of every external function I call. I'd just use the logical choice (like sscanf) and only consider optimising if things started to slow down and profiling the application highlighted it as a bottleneck.

  • I personally would, if it was listed in documentation. Doing stuff and profiling later is the right general approach to performance optimization. But what's better is not doing stupid mistakes in the first place, if they are trivial to avoid. To achieve that, you need to know the complexity guarantees of functions and data structures - or at least their ballpark (like, "this could be O(n) or perhaps O(n logn), definitely not worse").

    This is where setting the guarantees and documenting them is useful - it allows people to trivially avoid making these performance mistakes. Prevention is better than cure, in that - as GTA Online case demonstrates - in the latter stage of product development, people may not bother fixing performance anymore.

    • It might still not help in he case of sscanf. The documentation would specify that it’s O(N) in the size of the input string, just what one would expect without deeper thought. The problem is not O(N), the problem is that N is the complete input string, not just the part being parsed. The documentation would have to include a big fat warning about that.

    • Part of the issue in my mind is that big O complexity values don't (can't?) tell you the point of inflection, if they are nonlinear. Sscanf could be O(N^10) (does the rule about ignoring the coefficient also apply to the exponent?) but if it only starts to hurt your application at the point of 10tb string reads then it's still unimportant.

      I do agree that people often take "don't optimise early" as licence to not make optimisations up front that will clearly be needed (for example, implementing the buffer as a rope in a text editor), but I don't think this is the case here unless you test the function with reasonable future file sizes (something you should be doing anyway) and Sscanf profiles as a bottleneck.

      3 replies →

  • Yes, I absolutely think profiling and then only optimizing the actual problems is always a sound choice.

    I don't check the docs for every library function I use. I'm just saying, it wouldn't hurt if, when you do read the docs for standard library functions, the algorithmic complexity was mentioned in passing.

    • In principle, that sounds good. But then it can happen that you profiled when N=1000 and it seems fine. Then a few years later (like in GTA), N has grown to 63,000 and it's no longer fine. It seems unlikely the developer will go back and profile it again. Also, I think the original Windows Update algorithm for figuring out which updates you needed to download started out fine, but 20 years later it turns out it's quadratic and now there are tens of thousands of updates, so it becomes almost impossible to install XP SP2 from a CD and have it update itself.

      4 replies →

  • But you'll lose so much time doing that! Realizing there's a bug and investigating it is a huge amount of work compared to never writing it in the first place.

There are many implementations of the C standard library and thy may not have the same complexity. In this case the code has the right complexity(linear) but to the wrong n (string length instead of parse length)

> It would be nice if it were more common for standard library functions to include algorithmic complexity as part of the standard documentation.

Isn’t the point of standard library functions to define the contract and not implementation constraints? In other words, if algorithmic complexity is a significant concern, perhaps a standard library function is not a suitable choice.

  • Algorithmic complexity is not an implementation detail, it's part of the constraints on how you can use the function.

This is a cool idea..I'm sure it's been done before automatically in code, but I'd love to see this in my editor. It sounds similar to the bundle-size extension that you can use that will annotate how big a package that you're importing is in node.

> standard library functions to include algorithmic complexity as part of the standard documentation.

it would be acceptable to have a graph of the input size vs time, and this graph could be autogenerated using a testbed that is also used by unit testing! two birds one stone!

  • I don't think we could come up with a standard x axis bounds for such a graph, since n=1000, or 1,000,000 may not be zoomed out enough to showcase the behavior approaching infinity.