The return of the frame pointers

10 months ago (brendangregg.com)

I remember when the omission of stack frame pointers started spreading at the beginning of the 2000s. I was in college at the time, studying computer sciences in a very poor third-world country. Our computers were old and far from powerful. So, for most course projects, we would eschew interprets and use compilers. Mind you, what my college lacked in money it compensated by having interesting course work. We studied and implemented low level data-structures, compilers, assembly-code numerical routines and even a device driver for Minix.

During my first two years in college, if one of our programs did something funny, I would attach gdb and see what was happening at assembly level. I got used to "walking the stack" manually, though the debugger often helped a lot. Happy times, until all of the sudden, "-fomit-frame-pointer" was all the rage, and stack traces stopped making sense. Just like that, debugging that segfault or illegal instruction became exponentially harder. A short time later, I started using Python for almost everything to avoid broken debugging sessions. So, I lost an order of magnitude or two with "-fomit-frame-pointer". But learning Python served me well for other adventures.

I'm glad he mentioned Fedora because it's been a tiresome battle to keep frame pointers enabled in the whole distribution (eg https://pagure.io/fesco/issue/3084).

There's a persistent myth that frame pointers have a huge overhead, because there was a single Python case that had a +10% slow down (now fixed). The actual measured overhead is under 1%, which is far outweighed by the benefits we've been able to make in certain applications.

  • I believe it's a misrepresentation to say that "actual measured overhead is under 1%". I don't think such a claim can be universally applied because this depends on the very workload you're measuring the overhead with.

    FWIW your results don't quite match the measurements from Linux kernel folks who claim that the overhead is anywhere between 5-10%. Source: https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...

       I didn't preserve the data involved but in a variety of workloads including netperf, page allocator microbenchmark, pgbench and sqlite, enabling framepointer introduced overhead of around the 5-10% mark.
    

    Significance in their results IMO is in the fact that they measured the impact by using PostgreSQL and SQLite. If anything, DBMS are one of the best ways to really stress out the system.

    • Those are numbers from 7 years ago, so they're beginning to get a bit stale as people start to put more weight behind having frame pointers and make upstream contributions to their compilers to improve their output. People put it at <1% from much more recent testing by the very R.W.M. Jones you're replying to [0] and separate testing by others like Brendan Gregg [1b], whose post this is commenting on (and included [1b] in the Appendix as well), with similar accounts by others in the last couple years. Oh, and if you use flamegraph, you might want to check the repo for a familiar name.

      Some programs, like Python, have reported worse, 2-7% [2], but there is traction on tackling that [1a] (see both rwmj's and brendangregg's replies to sibling comments, they've both done a lot of upstreamed work wrt. frame pointers, performance, and profiling).

      As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles. Plus, you can always disable it in specific hotspots later when needed, which is much easier than the reverse.

      Something, something, premature optimisation -- though in seriousness, this information benefits actual optimisation, exactly because we don't have the information and understanding that would allow truly universal claims, precisely because things like this haven't been available, and so haven't been widely used. We know frame pointers, from additional register pressure and extended function prologue/epilogue, can be a detriment in certain hotspots; that's why we have granular control. But without them, we often don't know which hotspots are actually affected, so I'm sure even the databases would benefit... though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency, so even a claimed "10%" drop there probably doesn't impact actual real-world usage, but that's a reason why some of the most interesting profiling work lately has been from ideas like causal profilers and continuous profilers, which answer exactly that.

      [0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar... [1a]: https://pagure.io/fesco/issue/2817#comment-826636 [1b]: https://pagure.io/fesco/issue/2817#comment-826805 [2]: https://discuss.python.org/t/the-performance-of-python-with-...

      11 replies →

  • You probably already know, but with OCaml 5 the only way to get flamegraphs working is to either:

    * use framepointers [1]

    * use LBR (but LBR has a limited depth, and may not work on on all CPUs, I'm assuming due to bugs in perf)

    * implement some deep changes in how perf works to handle the 2 stacks in OCaml (I don't even know if this would be possible), or write/adapt some eBPF code to do it

    OCaml 5 has a separate stack for OCaml code and C code, and although GDB can link them based on DWARF info, perf DWARF call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563#issuecomment-193...)

    If you need more evidence to keep it enabled in future releases, you can use OCaml 5 as an example (unfortunately there aren't many OCaml applications, so that may not carry too much weight on its own).

    [1]: I haven't actually realised that Fedora39 has already enabled FP by default, nice! (I still do most of my day-to-day profiling on an ~CentOS 7 system with 'perf record --call-graph dwarf -F 47 -a', I was aware that there was a discussion to enable FP by default, but haven't noticed it has actually been done already)

  • Frame pointers are still a no-go on 32bit so anything that is IoT today.

    The reason we removed them was not a myth but comes from the pre-64 bit days. Not that long ago actually.

    Even today if you want to repurpose older 64 bit systems with a new life then this of optimization still makes sense.

    Ideally it should be the default also for security critical systems because not everything needs to be optimized for "observability"

    • > Frame pointers are still a no-go on 32bit so anything that is IoT today.

      Isn't that just 32-bit x86, which isn't used in IoT? The other 32-bit ISAs aren't register-starved like x86.

      4 replies →

  • Thanks; what was the Python fix?

  • I'm still baffled by this attitude. That "under 1%" overhead is why computers are measurably slower than 30 years ago to use. All those "under 1%" overhead add up

That's one thing Apple did do right on ARM:

> The frame pointer register (x29) must always address a valid frame record. Some functions — such as leaf functions or tail calls — may opt not to create an entry in this list. As a result, stack traces are always meaningful, even without debug information.

https://developer.apple.com/documentation/xcode/writing-arm6...

  • On Apple platforms, there is often an interpretability problem of another kind: Because of the prevalence of deeply nested blocks / closures, backtraces for Objective C / Swift apps are often spread across numerous threads. I don't know of a good solution for that yet.

    • I'm not very familiar with Objective C and Swift, so this might not make sense. But JS used to have a similar problem with async/await. The v8 engine solved it by walking the chain of JS promises to recover the "logical stack" developers are interested in [1].

      [1] https://v8.dev/blog/fast-async

      1 reply →

I was at Google in 2005 on the other side of the argument. My view back then was simple:

Even if $BIG_COMPANY makes a decision to compile everything with frame pointers, the rest of the community is not. So we'll be stuck fighting an unwinnable argument with a much larger community. Turns out that it was a ~20 year argument.

I ended up writing some patches to make libunwind work for gperftools and maintained libunwind for some number of years as a consequence of that work.

Having moved on to other areas of computing, I'm now a passive observer. But it's fascinating to read history from the other perspective.

  • > So we'll be stuck fighting an unwinnable argument with a much larger community.

    In what way would you be stuck? What functional problems does adding frame pointers introduce?

  • [flagged]

    • The clear and obvious win would have been adoption of a universal userspace generic unwind facility, like Windows has --- one that works with multiple languages. Turning on frame pointers is throwing in the towel on the performance tooling ecosystem coordination problem: we can't get people to fix unwind information, so we do this instead? Ugh.

      12 replies →

    • I think this came off somewhat aggressive. I vouched for the comment because flagging it is an absurd overreaction, but I also don't think pointing out isolated individuals would be of much help.

      Barriers to progress here are best identified on a community level, wouldn't you say?

      But people, please calm down. Filing an issue or posting to the mailing list to make a case isn't sending a SWAT team to people's home. It's a technical issue, one well within the envelope of topics which can be resolved politely and on the merits.

Of course, if you cede RBP to be a frame pointer, you may as well have two stacks, one which is pointed into by RBP and stores the activation frames, and the other one which is pointed into by RSP and stores the return addresses only. At this point, you don't even need to "walk the stack" because the call stack is literally just a flat array of return addresses.

Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

  • It simplifies storage management. A stack frame is a simple bump pointer which is always in cache and only one guard page for overflow, in your proposal you need two guard pages and double the stack manipulations and doubling the chance of a cache miss.

    • Yes, two guard pages are needed. No, the stack management stays the same: it's just "CALL func" at the call site, "SUB RBP, <frame_size>" at the prologue and "ADD RBP, <frame_size>; RET" at the epilogue. As for chances of a cache miss... probably, but I guess you also double them up when you enable CFET/Shadow Stack so eh.

      In exchange, it becomes very difficult for the stack smashing to corrupt the return address.

  • Note the ‘shadow stacks’ CPU feature mentioned briefly in the article, though it’s more for security reasons. It’s pretty similar to what you describe.

    • Shadow stacks have been proposed as an alternative, although it's my understanding that in current CPUs they hold only a limited number of frames, like 16 or 32?

      1 reply →

  • While here, why do we grow the stack the wrong way so misbehaved programs cause security issues? I know the reason of course, like so many things it last made sense 30 years ago, but the effects have been interesting.

  • You may be ready for Forth [1] ;-). Strangely, the Wikipedia article apparently doesn't put forward that Forth allows access both to the parameter and the return stack, which is a major feature of the model.

    [1] https://en.wikipedia.org/wiki/Forth_(programming_language)

    • That does seem like a significant oversight. >r and r>, and cousins, are part of ANSI Forth, and I've never used a Forth which doesn't have them.

  • >Why do we normally store the return addresses near to the local variables in the first place, again? There are so many downsides.

    The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

    You'd have to argue that the cost of moving things to this other page and managing two pointers (where one is less powerful in the ISA) is meaningfully cheaper than the other equally effective mitigation of stack cookies/protectors which are already able to provide protection only where needed. There is no real security benefit to doing this over what we currently have with stack protectors since an arbitrary read/write will still lead to a CFI bypass.

    • > The advantage of storing them elsewhere is not quite clear (unless you have hardware support for things like shadow stacks).

      The classic buffer overflow issue should spring immediately to mind. By having a separate return address stack it's far less vulnerable to corruption through overflowing your data structures. This stops a bunch of attacks which purposely put crafted return addresses into position that will jump the program to malicious code.

      It's not a panacea, but generally keeping code pointers away from data structures is a good idea.

Virgil doesn't use frame pointers. If you don't have dynamic stack allocation, the frame of a given function has a fixed size can be found with a simple (binary-search) table lookup. Virgil's technique uses an additional page-indexed range that further restricts the lookup to be a few comparisons on average (O(log(# retpoints per page)). It combines the unwind info with stackmaps for GC. It takes very little space.

The main driver is in (https://github.com/titzer/virgil/blob/master/rt/native/Nativ... the rest of the code in the directory implements the decoding of metadata.

I think frame pointers only make sense if frames are dynamically-sized (i.e. have stack allocation of data). Otherwise it seems weird to me that a dynamic mechanism is used when a static mechanism would suffice; mostly because no one agreed on an ABI for the metadata encoding, or an unwind routine.

I believe the 1-2% measurement number. That's in the same ballpark as pervasive checks for array bounds checks. It's weird that the odd debugging and profiling task gets special pleading for a 1% cost but adding a layer of security gets the finger. Very bizarre priorities.

  • You can add bounds checks to c, but that costs a hell of a lot more than 1-2%. C++ has them off by default for std::vector because c++ is designed by and for the utterly insane. Other than that, I can't off the top of my head think of a language that doesn't have them.

    • The bounds safety C compiler extension research by Apple has measured the runtime impact of adding bounds checking to C and it is not a lot more than 1-2% in almost all cases. Even in microbenchmarks its often around 5%. The impact on media encoding and decoding was around 1-2% and the overall power use on the device did not change.

      https://www.youtube.com/watch?v=RK9bfrsMdAM https://llvm.org/devmtg/2023-05/slides/TechnicalTalks-May11/...

      It's a myth that bounds checking has extraordinary performance costs and cannot be enabled without slowing everything to a halt. Maybe this was the case 10 years ago or 20 years ago or something, but not today.

    • > C++ has them off by default for std::vector because c++ is designed by and for the utterly insane.

      And for those who value performance and don't want to pay the cost of "a lot more than 1-2%" ;p

      4 replies →

Good post!

> Profiling has been broken for 20 years and we've only now just fixed it.

It was a shame when they went away. Lots of people, certainly on other systems and probably Linux too, have found the absence of frame pointers painful this whole time and tried to keep them available in as many environments as possible. It’s validating (if also kind of frustrating) to see mainstream Linux bring them back.

  • I’m sincerely curious. While I realize that using dwarf for unwinding is annoying, why is it so bad that it’s worth pessimizing all code on the system? It’s slow on Debian derivatives because they package only the slow unwinding path for perf for example, for license reasons, but with decent tooling I barely notice the difference. What am I missing?

    • Using DWARF is annoying with perf because the kernel doesn't support stack unwinding with DWARF, and never will. The kernel has to be the one which unwinds the user space stacks, because it is the one managing the perf counters and handling the interrupts.

      Since it can't unwind the user space stacks, the kernel has to copy the entire stack (8192 bytes by default) into the output file (perf.data) and then the perf user space program will unwind it later. It does this for each sample, which is usually hundreds of times per second, per CPU. Though it depends how you configured the collection.

      That does have a significant overhead: first, runtime overhead: copying 8k bytes, hundreds of times per second, and writing it to disk, all don't come for free. You spend quite a bit of CPU time doing the memcpy operation which consumes memory bandwidth too. You also frequently need to increase the size of the perf memory buffer to accommodate all this data while it waits for user space to write it to disk. Second, disk space overhead, since the 8k stack bytes per sample are far larger than the stack trace would be. And third, it does require that you install debuginfo packages to get the DWARF info, which is usually a pain to do on production machines, and they consume a lot of disk space on their own.

      Many of these overheads aren't too bad in simple cases (lower sample rates, fewer CPUs, or targeting a single task). But with larger machines with hundreds of CPUs, full system collections, and higher frequencies, the overhead can increase exponentially.

      I'm not certain I know what you mean by the "slow unwinding path for perf", as there is no faster path for user space when frame pointers are disabled (except Intel LBR as outlined in the blog).

    • I assume you're talking about the “nondistro build”?

      The difference between -g (--call-graph fp) and --call-graph dwarf is large even with perf linked directly against binutils, at least in my experience (and it was much, much worse until I got some patches into binutils to make it faster). This is both on record and report.

      There are also weird bugs with --call-graph dwarf that perf upstream isn't doing anything about, around inlining. It's not perfect by any means.

Overall, I am for frame pointers, but after some years working in this space, I thought I would share some thoughts:

* Many frame pointer unwinders don't account for a problem they have that DWARF unwind info doesn't have: the fact that the frame set-up is not atomic, it's done in two instructions, `push $rbp` and `mov $rsp $rbp`, and if when a snapshot is taken we are in the `push`, we'll miss the parent frame. I think this might be able to be fired by inspecting the code, but I think this might only be as good as a heuristic as there could be other `push %rbp` unrelated to the stack frame. I would love to hear if there's a better approach!

* I developed the solution Brendan mentions which allows faster, in-kernel unwinding without frame pointers using BPF [0]. This doesn't use DWARF CFI (the unwind info) as-is but converts it into a random-access format that we can use in BPF. He mentions not supporting JVM languages, and while it's true that right now it only supports JIT sections that have frame pointers, I planned to implement a full JVM interpreter unwinder. I have left Polar Signals since and shifted priorities but it's feasible to get a JVM unwinder to work in lockstep with the native unwinder.

* In an ideal world, enabling frame pointers should be done on a case-by-case. Benchmarking is key, and the tradeoffs that you make might change a lot depending on the industry you are in, and what your software is doing. In the past I have seen large projects enabling/disabling frame pointers not doing an in-depth assessment of losses/gains of performance, observability, and how they connect to business metrics. The Fedora folks have done a superb and rigorous job here.

* Related to the previous point, having a build system that enables you to change this system-wide, including libraries your software depends on can be awesome to not only test these changes but also put them in production.

* Lastly, I am quite excited about SFrame that Indu is working on. It's going to solve a lot of the problems we are facing right now while letting users decide whether they use frame pointers. I can't wait for it, but I am afraid it might take several years until all the infrastructure is in place and everybody upgrades to it.

- [0]: https://web.archive.org/web/20231222054207/https://www.polar...

  • On the third point, you have to do frame pointers across the whole Linux distro in order to be able to get good flamegraphs. You have to do whole system analysis to really understand what's going on. The way that current binary Linux distros (like Fedora and Debian) works makes any alternative impossible.

  • It could be one instruction: ENTER N,0 (where N is the amount of stack space to reserve for locals)---this is the same as:

        PUSH EBP
        MOV  ESP,ESP
        SUB  SP,N
    

    (I don't recall if ENTER is x86-64 or not). But even with this, the frame setup isn't atomic with respect to CALL, and if the snapshot is taken after the CALL but before the ENTER, we still don't get the fame setup.

    As for the reason why ENTER isn't used, it was deemed too slow. LEAVE (MOV SP,BP; POP BP) is used as it's just as fast as, if not faster, than the sequence it replaces. If ENTER were just the PUSH/MOV/SUB sequence, it probably would be used, but it's that other operand (which is 0 above in my example) that kills it performance wise (it's for nested functions to gain access to upper stack frames and is every expensive to use).

  • Great comments, thanks for sharing. The non-atomic frame setup is indeed problematic for CPU profilers, but it's not an issue for allocation profiling, Off-CPU profiling or other types off non-interrupt driven profiling. But as you mentioned, there might be ways to solve that problem.

That's very interesting to me - I had seen the `[unknown]` mountain in my profiles but never knew why. I think it's a tough thing to justify: 2% performance is actually a pretty big difference.

It would be really nice to have fine-grained control over frame pointer inclusion: provided fine-grained profiling, we could determine whether we needed the frame pointers for a given function or compilation unit. I wouldn't be surprised if we see that only a handful of operations are dramatically slowed by frame pointer inclusion while the rest don't really care.

  • > 2% performance is actually a pretty big difference.

    No it's not, particularly when it can help you identify hotspots via profiling that can net you improvements of 10% or more.

    • Sure, but how many of the people running distro compiled code do perf analysis? And how many of the people who need to do perf analysis are unable to use a with-frame-pointers version when they need to? And how many of those 10% perf improvements are in common distro code that get upstreamed to improve general user experience, as opposed to being in private application code?

      If you're netflix then "enable frame pointers" is a no-brainer. But if you're a distro who's building code for millions of users, many of whom will likely never need to fire up a profiler, I think the question is at least a little trickier. The overall best tradeoff might end up being still to enable frame pointers, but I can see the other side too.

      6 replies →

  • You can turn it on/off per function by attaching one of these GCC attribute to the function declaration (although it doesn't work on LLVM):

      __attribute__((optimize("no-omit-frame-pointer")))
      __attribute__((optimize("omit-frame-pointer")))

  • The performance cost in your case may be much smaller than 2 per cent.

    Don't completely trust the benchmarks on this; they are a bit synthetic and real-world applications tend to produce very different results.

    Plus, profiling is important. I was able to speed up various segments of my code by up to 20 per cent by profiling them carefully.

    And, at the end of the day, if your application is so sensitive about any loss of performance, you can simply profile your code in your lab using frame pointers, then omit them in the version released to your customers.

    • > And, at the end of the day, if your application is so sensitive about any loss of performance, you can simply profile your code in your lab using frame pointers, then omit them in the version released to your customers.

      That is what should be done but TFA is about distros shipping code with frame pointers to end uses because some developers are too lazy to recompile libc when profiling. Somehow shipping different copies of libc, one indended for end users on low-powered devices and one indended for developers is not even considered.

    • If you can't introspect the release version of your software, you have no way of determining what the issue is. You're doing psuedo-science and guesswork to try and replicate the issue on a development version of the software. And if you put in a few new logging statements into the release version, there's a pretty good chance that simply restarting the software will cause the symptom to go away.

  • The measured overhead is slightly less than 1%. There have been some rare historical cases where frame pointers have caused performance to blow up but those are fixed.

JIT'ed code is sadly poorly supported, but LLVM has had great hooks for noting each method that is produced and its address. So you can build a simple mixed-mode unwinder, pretty easily, but mostly in process.

I think Intel's DNN things dump their info out to some common file that perf can read instead, but because the *kernels* themselves reuse rbp throughout oneDNN, it's totally useless.

Finally, can any JVM folks explain this claim about DWARF info from the article:

> Doesn't exist for JIT'd runtimes like the Java JVM

that just sounds surprising to me. Is it off by default or literally not available? (Google searches have mostly pointed to people wanting to include the JNI/C side of a JVM stack, like https://github.com/async-profiler/async-profiler/issues/215).

Just as a general comment on this topic...

The fact that people complain about the performance of the mechanism that enables the system to be profiled, and so performance problems be identified, is beyond ironic. Surely the epitome of premature optimisation.

  • It's not moronic when the people profiling and the people having performance issues are different groups. That some developer benefits from having frame pointers in libc does not mean that all users of that software also need to have frame pointers enabled.

  • It goes both ways. I see people with ultra-bloated applications trying to add even more bloat and force it on the rest of us too. Like the guy saying that DWARF unwinding is impractical when you have 1Gbyte code cache.

    I don't see the people writing hyperfast low-level code asking for this.

    My experience with profiling is anyway that you don't need that fine-grained profiling. It's main use is finding stuff like "we spend 90% of our time reallocating a string over and over to add characters one at a time". After a few of those it's just "it's a little bit slow everywhere".

    • > After a few of those it's just "it's a little bit slow everywhere".

      And after that, you need fine-grained profiling to find multiple 1% wins and apply them repeatedly.

      (I do this for a living, basically)

      1 reply →

  • So what are these other techniques the 2004 migration from frame pointers assumed would work for stack walking? Why don't they work today? I get that _64 has a lot more registers, so there's minimal value to +1 the register?

    • In 2004, the assumption made by the GCC developers was that you would be walking stacks very infrequently, in a debugger like GDB. Not sampling stacks 1000s of times a second for profiling.

  • im sure in ancient mesopotamia there was somebody arguing about you could brew beer faster if you stop measuring the hops so carefully but then someone else was saying yes but if you dont measure the hops carefully then you dont know the efficiency of your overall beer making process so you cant isolate the bottlenecks.

    the funny thing is i am not sure if the world would actually work properly if we didn't have both of these kinds of people.

This doesn't detract from the content at all but the register counts are off; SI and DI count as GPRs on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

  • > [...] on i686 bringing it to 6+BP (not 4+BP) meanwhile x86_64 has 14+BP (not 16+BP).

    That is, on i686 you have 7 GPRs without frame pointers, while on x86_64 you have 14 GPRs even with frame pointers.

    Copying a comment of mine from an older related discussion (https://news.ycombinator.com/item?id=38632848):

    "To emphasize this point: on 64-bit x86 with frame pointers, you have twice as many registers as on 32-bit x86 without frame pointers, and these registers are twice as wide. A 64-bit value (more common than you'd expect even when pointers are 32 bits) takes two registers on 32-bit x86, but only a single register on 64-bit x86."

As much as the return of frame pointers is a good thing, it's largely unnecessary -- it arrives at a point where multiple eBPF-based profilers are available that do fine using .eh_frame and also manually unwinding high level language runtime stacks: Both Parca from PolarSignals as well the artist formerly known as Prodfiler (now Elastic Universal Profiling) do fine.

So this is a solution for a problem, and it arrives just at the moment that people have solved the problem more generically ;)

(Prodfiler coauthor here, we had solved all of this by the time we launched in Summer 2021)

  • First of all, I think the .eh_frame unwinding y'all pioneered is great.

    But I think you're only thinking about CPU profiling at <= 100 Hz / core. However, Brendan's article is also talking about Off-CPU profiling, and as far as I can tell, all known techniques (scheduler tracing, wall clock sampling) require stack unwinding to occur 1-3 orders of magnitude more often than for CPU profiling.

    For those use cases, I don't think .eh_frame unwinding will be good enough, at least not for continuous profiling. E.g. see [1][2] for an example of how frame pointer unwinding allowed the Go runtime to lower execution tracing overhead from 10-20% to 1-2%, even so it was already using a relatively fast lookup table approach.

    [1] https://go.dev/blog/execution-traces-2024

    [2] https://blog.felixge.de/reducing-gos-execution-tracer-overhe...

  • I'm under the impression that eh_frame stack traces are much slower than frame pointer stack traces, which makes always-on profiling, such as seen in tcmalloc, impractical.

  • Also I've heard that the whole .eh_frame unwinding is more fragile than a simple frame pointer. I've seen enough broken stack traces myself, but honestly I never tried if -fno-omit-frame-pointer would have helped.

    • Yes and no. A simple frame pointer needs to be present in all libraries, and depending on build settings, this might not be the case. .eh_frame tends to be emitted almost everywhere...

      So it's both similarly fragile, but one is almost never disabled.

      The broader point is: For HLL runtimes you need to be able to switch between native and interpreted unwinds anyhow, so you'll always do some amount of lifting in eBPF land.

      And yes, having frame pointers removes a lot of complexity, so it's net a very good thing. It's just that the situation wasnt nearly as dire as described, because people that care about profiling had built solutions.

      3 replies →

  • You mean we don‘t need accessible profiling in free software because there are companies selling it to us. Cool.

    • Parca is open-source, Prodfiler's eBPF code is GPL, and the rest of Prodfiler is currently going through OTel donation, so my point is: There's now multiple FOSS implementations of a more generic and powerful technique.

  • If you're sufficiently in control of your deployment details to ensure that BPF is available at all. CAP_SYS_PTRACE is available ~everywhere for everyone.

I thought we'd been using /Oy (Frame-Pointer Omission) for years on Windows and that there was a pdata section on x64 that was used for stack-walking however to my great surprise I just read on MSDN that "In x64 compilers, /Oy and /Oy- are not available."

Does this mean Microsoft decided they weren't going to support breaking profilers and debuggers OR is there some magic in the pdata section that makes it work even if you omit the frame-pointer?

All of this information is static, there's no need to sacrifice a whole CPU register only to store data that's already known. A simple lookup data structure that maps an instruction address range to the stack offset of the return address should be enough to recover the stack layout. On Windows, you'd precompute that from PDB files, I'm sure you can do the same thing with whatever the equivalent debug data structure is on Linux.

Are his books (the one about Systems Performance and eBPF) relevant for normal software engineers who want to improve performance in normal services? I don’t work for faang, and our usual performance issues are solved by adding indexes here and there, caching, and simple code analysis. Tools like Datadog help a lot already.

  • Profiling is a pretty basic technique that is applicable to all software engineering. I'm not sure what a "normal" service is here, but I think we all have an obligation to understand what's happening in the systems we own.

    Some people may believe that 100ms latency is acceptable for a CLI tool, but what if it could be 3ms? On some aesthetic level, it also feels good to be able to eliminate excess. Finally, you should learn it because you won't necessarily have that job forever.

  • Diving into flame graphs being worthwhile for optimization, assumes that your workload is CPU-bound. Most business software does not have such workloads, and rather (as you yourself have noted) spend most of their time waiting for I/O (database, network, filesystem, etc).

    And so, (as you again have noted), your best bet is to just use plain old logging and tracing (like what datadog provides) to find out where the waiting is happening.

NiX (and I assume Guix) are very convenient for this as it is fairly easy to turn frame pointers on or off for parts or whole of the system.

I am not sure, but I believe -fomit-frame-pointer in x86-64 allows the compiler to use a _thirteenth_ register, not a _seventeenth_ .

There's another option: https://lesenechal.fr/en/linux/unwinding-the-stack-the-hard-...

I started programming in 1979, and I can't believe I've managed to avoid learning about stack frames all those EBP register tricks until now. I always had parameters to functions in registers, not on the stack, for the most part. The compiler hid a lot of things from me.

Is it because I avoided Linux and C most of my life? Perhaps it's because I used debug, and Periscope before that... and never gdb?

so what is the downside to using e.g. dwarf-based stack walking (supported by perf) for libc, which was the original stated problem?

in the discussion the issue gets conflated with jit-ted languages, but that has nothing to do with the crusade to enable frame pointer for system libraries.

and if you care that much for dwarf overhead... just cache the unwind information in your system-level profiler? no need to rebuild everything.

  • The way perf does it is slow, as the entire stack is copied into user-space and is then asynchronously unwound.

    This is solvable as Brendan calls out, we’ve created an eBPF-based profiler at Polar Signals, that essentially does what you said, it optimized the unwind tables, caches them in bpf maps, and then synchronously unwinds as opposed to copying the whole stack into user-space.

    • It should also be said that you need some sort of DWARF-like information to understand inlining. If I have a function A that inlines B that in turn inlines C, I'd often like to understand that C takes a bunch of time, and with frame pointers only, that information gets lost.

      1 reply →

    • This conveniently sidesteps the whole issue of getting DWARF data in the first place, which is also still a broken disjointed mess on Linux. Hell, Windows solved this many many years ago.

      1 reply →

  • The article explains why DWARF is not an option.

    • Extremely light on the details, and also conflates it with the JIT which makes it harder to understand the point, so I was wondering about the same thing as well.

Have compilers (or I guess x86?) gotten better at dealing with the frame pointer? Or are we just saying that taking a significant perf hit is acceptable if it lets you find other tiny perf problems? Because I recall -fomit-frame-pointer being a significant performance win, bigger than most of the things that you need a perfect profiler to spot.

Brendan is such a treasure to the community (buy his book it’s great).

I wasn’t doing extreme performance stuff when -fomit-frame-pointer became the norm, so maybe it was a big win for enough people to be a sane default, but even that seems dubious: “just works” profiling is how you figure out when you’re in an extreme performance scenario (if you’re an SG14 WG type, you know it and are used to all the defaults being wrong for you).

I’m deeply grateful for all the legends who have worked on libunwind, gperf stuff, perftool, DTrace, eBPF: these are the too-often-unsung heroes of software that is still fast after decades of Moore’s law free-riding.

But they’ve been fighting an uphill battle against a weird alliance of people trying to game compiler benchmarks and the really irresponsible posture that “developer time is more expensive” which is only sometimes true and never true if you care about people on low-spec gear, which is the community of users who that is already the least-resourced part of the global community.

I’m fortunate enough to have a fairly modern desktop, laptop, and phone: for me it’s merely annoying that chat applications and music players and windowing systems offer nothing new except enshittification in terms of features while needing 10-100x the resources they did a decade ago.

But for half of my career and 2/3rds of my time coding, I was on low-spec gear most of the time, and I would have been largely excluded if people didn’t care a lot about old computers back then.

I’m trying to help a couple of aspiring hackers get started right now it’s a real struggle to get their environments set up with limitations like Intel Macs and WSL2 as the Linux option (WSL2 is very cool but it’s not loved up enough by e.g. yarn projects).

If you want new hackers, you need to make things work well on older computers.

Thanks again Brendan et al!

I disagree with this sentence of the article:

"I could say that times have changed and now the original 2004 reasons for omitting frame pointers are no longer valid in 2024."

The original 2004 reason for omitting frame pointers is still valid in 2024: it's still a big performance win on the register-starved 32-bit x86 architecture. What has changed is that the 32-bit x86 architecture is much less relevant nowadays (other than legacy software, for most people it's only used for a small instant while starting up the firmware), and other common 32-bit architectures (like embedded 32-bit ARM) are not as register-starved as the 32-bit x86.

GCC optimization causes the frame pointer push to move around, resulting in wrong call stacks. "Wontfix"

https://news.ycombinator.com/item?id=38896343

  • That was in 2012. Does it still occur on modern GCC?

    There definitely have been regressions with frame pointers being enabled, although we've fixed all the ones we've found in current (2024) Fedora.

    • I think so and I vaguely seem to recall -fno-schedule-insns2 being the only thing that fixes it. To get the full power of frame pointers and hackable binary, what I use is:

          -fno-schedule-insns2
          -fno-omit-frame-pointers
          -fno-optimize-sibling-calls
          -mno-omit-leaf-frame-pointer
          -fpatchable-function-entry=18,16
          -fno-inline-functions-called-once
      

      The only flag that's potentially problematic is -fno-optimize-sibling-calls since it breaks the optimal approach to writing interpreters and slows down code that's written in a more mathematical style.

All this talk about frame pointers enabled on various Linux distros and we still haven’t talked about the biggest one, which would be Android. Have they decided to compile with frame pointers yet? The last time I looked in to this many years ago, some perf folks on the Android team dismissed it as a perf regression (and also ART was already using the FP register for its own purposes).

That's really interesting. I disabled frame pointer omission in my project because I read that it hurts debugging and code introspection and provided only minimal performance benefits. I had no idea it had caused so much pain over decades.

To this day I still believe that there should be a dedicated protected separate stack region for the call stack that only the CPU can write to/read from. Walking the stack then becomes trivially fast because you just need to do a very small memcpy. And stack memory overflows can never overwrite the return address.

  • This is a thing; it's called shadow call stack. Both ARM and now Intel have extensions for it.

    • But the shadow stack concept seems much dumber to me. Why write the address to the regular stack and the shadow stack and then compare? Why not only use the shadow stack and not put return addresses on the main stack at all.

      4 replies →

It said gcc. I noted the default of llvm said to default with framepounter from 2011. Is this mainly a gcc issue?

I'm surprised that most the comments are concerned about the telemetry and observabilty per se rather the potential of frame pointers to the programmers at large. This can provide beneficial insight for hard to debug programming constructs for examples delegates where C# and D have but not Rust, and also CTFE where D and Zig have but not Rust. In the new future I can foresee is that seamless frame pointers integration with programming language constructs can ease debugging of these extremely useful higher order function constructs like delegates. It also has the potential to enable futuristic programming paradigm as proposed by Bret Victor's presentation [2].

[1] D Language Reference: Function Pointers, Delegates and Closures:

https://dlang.org/spec/function.html#function-pointers-deleg...

[2] Bret Victor The Future of Programming:

https://youtu.be/8pTEmbeENF4

-fomit-frame-pointer Had serious performance improvements back in the day. I would always compile my own MySQL and PHP and save money on hardware.

But those days has 32bit processors with few registers.

Times change.

glibc is only 2 MB, why Chrome relies on system glibc instead of statically linking their own version with frame pointers enabled?

Not interesting. Enter/leave also does the same thing as your save/restore rbp.

Far more interesting I recall there might be an instruction where rbp isn't allowed.

[flagged]

  • I’m not sure what use case you’re coming from but it sounds like you’re saying something like: most end users don’t use a profiler or debugger so why should they pay the cost of debuggability? That’s fine I guess if you’re throwing software over a wall to users and taking no responsibility for their experience. But a lot of people who build software do take some responsibility for bugs and performance problems that their users experience. This stuff is invaluable for them. End users benefit (tremendously) from software being debuggable even if the users themselves never run a profiler or debugger that uses the frame pointers (because the developers are able to find and fix problems reported by other users or the developers themselves).

  • > Let's make software more inefficient (even if it's tiny, it all adds up!)

    I'm not sure if you know who the author of that blog is, but if there's anyone in the world who cares about (and is knowledgeable about) improving performance, it's him. You can be pretty darn sure he wouldn't do this if he believed it would make software more inefficient.

  • We have an embarrassment of riches in terms of compute power, slowing down everything by a negligible amount is worth it if it makes profiling and observability even 20% easier. In almost all cases you cannot just walk up to a production Linux system and do meaningful performance analysis without serious work, unlike say a Solaris production system from 15 years ago.

    • We have an embarrassment of riches in terms of compute power yet all software is still incredibly slow because we and up slowing down everything by a "negligible" amount here and another "negligible" amount there and so on.

      > In almost all cases you cannot just walk up to a production Linux system and do meaningful performance analysis without serious work

      So? That's one very very very specific use case that others should not have to pay for. Not even with a relatively small 1% perf hit.

      3 replies →

  • > Let's make software more inefficient

    Isn't the whole point of enabling frame pointers everywhere to make it easier to profile software so that it can be made more efficient?

  • You seem to be stuck in the 90-ties. Computing is 64-bit nowadays, not 32-bit, and modern architectures/ABIs integrate frame pointers and frame records in a way that’s both natural and performant.

  • As someone who worked on V8 for some years, I can assure that Web bloat is not due to frame pointers.

    • No but it is exactly the same attitude - so-called "tiny" perfomance hits for developer convenience.