← Back to context

Comment by TeMPOraL

4 years ago

I personally would, if it was listed in documentation. Doing stuff and profiling later is the right general approach to performance optimization. But what's better is not doing stupid mistakes in the first place, if they are trivial to avoid. To achieve that, you need to know the complexity guarantees of functions and data structures - or at least their ballpark (like, "this could be O(n) or perhaps O(n logn), definitely not worse").

This is where setting the guarantees and documenting them is useful - it allows people to trivially avoid making these performance mistakes. Prevention is better than cure, in that - as GTA Online case demonstrates - in the latter stage of product development, people may not bother fixing performance anymore.

It might still not help in he case of sscanf. The documentation would specify that it’s O(N) in the size of the input string, just what one would expect without deeper thought. The problem is not O(N), the problem is that N is the complete input string, not just the part being parsed. The documentation would have to include a big fat warning about that.

Part of the issue in my mind is that big O complexity values don't (can't?) tell you the point of inflection, if they are nonlinear. Sscanf could be O(N^10) (does the rule about ignoring the coefficient also apply to the exponent?) but if it only starts to hurt your application at the point of 10tb string reads then it's still unimportant.

I do agree that people often take "don't optimise early" as licence to not make optimisations up front that will clearly be needed (for example, implementing the buffer as a rope in a text editor), but I don't think this is the case here unless you test the function with reasonable future file sizes (something you should be doing anyway) and Sscanf profiles as a bottleneck.

  • I agree. And I'd personally probably trip over it too, if I used C functions much (I use C++ standard library equivalents, and I'm going to recheck string parsing code in our project now, because we may have that same problem, just with different API!). strlen() in a loop is something that can sneak up on you by virtue of having too many layers of abstraction - scanf() family being one of them.

    But what I'm saying is, documenting big O complexity is useful, even if imperfect (and if the function has "tricky" complexity, the conditions where it gets bad should be documented too!).

    > Sscanf could be O(N^10) (does the rule about ignoring the coefficient also apply to the exponent?) but if it only starts to hurt your application at the point of 10tb string reads then it's still unimportant.

    Sure, but then applications grow, functions get reused. Iff it's trivial to spot and replace O(N^10) Sscanf with a O(n) alternative, I'd want to know the complexity and do the replacement immediately - otherwise, you may discover two years later that the company is investing employee-years into horizontally scaling your application as it's slow under workload, where fixing that one thing would let it run that same workload on a laptop, on a single core.

    (This is not a joke, there are "big data" stories like this.)

    • > But what I'm saying is, documenting big O complexity is useful, even if imperfect (and if the function has "tricky" complexity, the conditions where it gets bad should be documented too!).

      Ah yes I totally agree on this point, it should be documented for those who are more cautious/mathematical in their engineering efforts. I'm just not one of those people! I prefer to focus on the structural behaviour of a team (if it's my responsibility) to ensure that cases like:

      > two years later that the company is investing employee-years into horizontally scaling your application as it's slow under workload

      Resolve correctly - if a team is set up to respond healthily to production slowdowns (of course the equation is different for software with a slow/nonexistent update loop) then IMO you don't need to invest as heavily into up-front optimisation and can instead invest into features, allowing (sample-based for busy paths) profiling in production to notify you when optimisation is warranted.

      At the end of the day, up front optimisation is an exchange of time with your future self/colleagues. It's worth it in some cases, not in others - but knowing where the balance in a tradeoff lies in your circumstance is a good chunk of all engineering decisions!

  • The important part of the documentation would be that N is the length of the entire string, not just the subset of data that needs to be processed. The actual complexity isn't relevant in this case.