Comment by londons_explore

5 years ago

This article suggests a big reason to use this approach is to seperate hot and cold code.

I assume that's for good use of the CPU's instruction and microcode caches.

Yet in a protobuf parser, I'm surprised there is enough code to fill said caches, even if you put the hot and cold code together. Protobuf just isn't that complicated!

Am I wrong?

> I assume that's for good use of the CPU's instruction and microcode caches.

I don't think that is the reason. These are microbenchmark results, where realistically all the code will be hot in caches anyway.

The problem is that a compiler optimizes an entire function as a whole. If you have slow paths in the same function as fast paths, it can cause the fast paths to get worse code, even if the slow paths are never executed!

You might hope that using __builtin_expect(), aka LIKELY()/UNLIKELY() macros on the if statements would help. They do help somewhat, but not as much as just putting the slow paths in separate functions entirely.

In particular, the author is talking about CPU registers being spilled to memory, and the need for setting up or tearing down stack frames. Those things can only be eliminated by the compiler for extremely simple functions. The error-handling often isn't.

  • If you really care about performance there’s no particular reason the whole stack has to be set up at function beginning/end. Compilers are not flexible enough about this, or other things like how they won’t store values in the frame pointer register on i386.

    I believe the optimization to do this is called “shrink-wrapping”.