Comment by tempay

1 year ago

It “wastes” a register when you’re not actively using them. On x86 that can make a big difference, though with the added registers of x86_64 it much less significant.

10 comments

tempay

inkyoto 1 year ago

Wasting a register on comparatively more modern ISA's (PA-RISC 2.0, MIPS64, POWER, aarch64 etc – they are all more modern and have an abundance of general purpose registers) is not a concern.

The actual «wastage» is in having to generate a prologue and an epilogue for each function – 2x instructions to preserve the old frame pointer and set a new one up, and 2x instruction at the point of return – to restore the previous frame pointer.

Generally, it is not a big deal with an exception of a pathological case of a very large number of very small functions calling each other frequently where the extra 4x instructions per each such a function will be filling up the L1 instruction cache «unnessarily».

weebull 1 year ago
Those pathological cases are really what inlining is for, with the exception of any tiny recursive functions that can't be tail call optimised.
- inkyoto 1 year ago
  
  Yes, inlining (and LTO can take it a notch or two higher) does away with the problem altogether, however the number of projects that default to «-Os» (or even to «-O2») to build a release product is substantial to large.
  There is also a significant number of projects that go to great lengths to force-override CFLAGS/CXXFLAGS (usually with «-O2 -g» or even with «-O») or make it extraordinary difficult to change the project's default CFLAGS, for no apparent reasons which eliminates a number of advanced optimisations in builds with default build settings.

charleshn 1 year ago

It's not just the loss of an architectural register, it's also the added cost to the prologue/epilogue. Even on x86_64, it can make a difference, in particular for small functions, which might not be inlined for a variety of reasons.

Asooka 1 year ago
If your small function is not getting inlined, you should investigate why that is instead of globally breaking performance analysis of your code.
- Sesse__ 1 year ago
  
  A typical case would be C++ virtual member functions. (They can sometimes be devirtualized, or speculatively partially devirtualized, using LTO+PGO, but there are lots of legitimate cases where they cannot.)
kaba0 1 year ago
CPUs spend an enormous amount of time waiting for IO and memory, and push/pop and similar are just insanely well optimized. As the article also claims, I would be very surprised to see any effect, unless that more instructions themselves would spill the I-cache.
- charleshn 1 year ago
  
  I've seen around 1-3% on non micro benchmarks, real applications.
  Aee also this benchmark from Phoronix [0]:
  Of the 100 tests carried out for this article, when taking the geometric mean of all these benchmarks it equated to about a 14% performance penalty of the software with -O2 compared to when adding -fno-omit-frame-pointer.
  I'm not saying these benchmarks or the workloads I've seen are representative of the "real world", but people keep repeating that frame pointers are basically free, which is just not the case.
  [0] https://www.phoronix.com/review/fedora-frame-pointer

starspangled 1 year ago

Right, but I was asking about functional problems (being "stuck"), which sounded like a big issue for the choice.

nlewycky 1 year ago

It caused a problem when building inline assembly heavy code that tried to use all the registers, frame pointer register included.