Comment by charleshn
10 months ago
It's not just the loss of an architectural register, it's also the added cost to the prologue/epilogue. Even on x86_64, it can make a difference, in particular for small functions, which might not be inlined for a variety of reasons.
If your small function is not getting inlined, you should investigate why that is instead of globally breaking performance analysis of your code.
A typical case would be C++ virtual member functions. (They can sometimes be devirtualized, or speculatively partially devirtualized, using LTO+PGO, but there are lots of legitimate cases where they cannot.)
CPUs spend an enormous amount of time waiting for IO and memory, and push/pop and similar are just insanely well optimized. As the article also claims, I would be very surprised to see any effect, unless that more instructions themselves would spill the I-cache.
I've seen around 1-3% on non micro benchmarks, real applications.
Aee also this benchmark from Phoronix [0]:
I'm not saying these benchmarks or the workloads I've seen are representative of the "real world", but people keep repeating that frame pointers are basically free, which is just not the case.
[0] https://www.phoronix.com/review/fedora-frame-pointer