← Back to context

Comment by tempay

10 months ago

It “wastes” a register when you’re not actively using them. On x86 that can make a big difference, though with the added registers of x86_64 it much less significant.

Wasting a register on comparatively more modern ISA's (PA-RISC 2.0, MIPS64, POWER, aarch64 etc – they are all more modern and have an abundance of general purpose registers) is not a concern.

The actual «wastage» is in having to generate a prologue and an epilogue for each function – 2x instructions to preserve the old frame pointer and set a new one up, and 2x instruction at the point of return – to restore the previous frame pointer.

Generally, it is not a big deal with an exception of a pathological case of a very large number of very small functions calling each other frequently where the extra 4x instructions per each such a function will be filling up the L1 instruction cache «unnessarily».

  • Those pathological cases are really what inlining is for, with the exception of any tiny recursive functions that can't be tail call optimised.

    • Yes, inlining (and LTO can take it a notch or two higher) does away with the problem altogether, however the number of projects that default to «-Os» (or even to «-O2») to build a release product is substantial to large.

      There is also a significant number of projects that go to great lengths to force-override CFLAGS/CXXFLAGS (usually with «-O2 -g» or even with «-O») or make it extraordinary difficult to change the project's default CFLAGS, for no apparent reasons which eliminates a number of advanced optimisations in builds with default build settings.

It's not just the loss of an architectural register, it's also the added cost to the prologue/epilogue. Even on x86_64, it can make a difference, in particular for small functions, which might not be inlined for a variety of reasons.

  • If your small function is not getting inlined, you should investigate why that is instead of globally breaking performance analysis of your code.

    • A typical case would be C++ virtual member functions. (They can sometimes be devirtualized, or speculatively partially devirtualized, using LTO+PGO, but there are lots of legitimate cases where they cannot.)

  • CPUs spend an enormous amount of time waiting for IO and memory, and push/pop and similar are just insanely well optimized. As the article also claims, I would be very surprised to see any effect, unless that more instructions themselves would spill the I-cache.

    • I've seen around 1-3% on non micro benchmarks, real applications.

      Aee also this benchmark from Phoronix [0]:

        Of the 100 tests carried out for this article, when taking the geometric mean of all these benchmarks it equated to about a 14% performance penalty of the software with -O2 compared to when adding -fno-omit-frame-pointer.
      

      I'm not saying these benchmarks or the workloads I've seen are representative of the "real world", but people keep repeating that frame pointers are basically free, which is just not the case.

      [0] https://www.phoronix.com/review/fedora-frame-pointer

Right, but I was asking about functional problems (being "stuck"), which sounded like a big issue for the choice.

It caused a problem when building inline assembly heavy code that tried to use all the registers, frame pointer register included.