Comment by javierhonduco

10 months ago

Overall, I am for frame pointers, but after some years working in this space, I thought I would share some thoughts:

* Many frame pointer unwinders don't account for a problem they have that DWARF unwind info doesn't have: the fact that the frame set-up is not atomic, it's done in two instructions, `push $rbp` and `mov $rsp $rbp`, and if when a snapshot is taken we are in the `push`, we'll miss the parent frame. I think this might be able to be fired by inspecting the code, but I think this might only be as good as a heuristic as there could be other `push %rbp` unrelated to the stack frame. I would love to hear if there's a better approach!

* I developed the solution Brendan mentions which allows faster, in-kernel unwinding without frame pointers using BPF [0]. This doesn't use DWARF CFI (the unwind info) as-is but converts it into a random-access format that we can use in BPF. He mentions not supporting JVM languages, and while it's true that right now it only supports JIT sections that have frame pointers, I planned to implement a full JVM interpreter unwinder. I have left Polar Signals since and shifted priorities but it's feasible to get a JVM unwinder to work in lockstep with the native unwinder.

* In an ideal world, enabling frame pointers should be done on a case-by-case. Benchmarking is key, and the tradeoffs that you make might change a lot depending on the industry you are in, and what your software is doing. In the past I have seen large projects enabling/disabling frame pointers not doing an in-depth assessment of losses/gains of performance, observability, and how they connect to business metrics. The Fedora folks have done a superb and rigorous job here.

* Related to the previous point, having a build system that enables you to change this system-wide, including libraries your software depends on can be awesome to not only test these changes but also put them in production.

* Lastly, I am quite excited about SFrame that Indu is working on. It's going to solve a lot of the problems we are facing right now while letting users decide whether they use frame pointers. I can't wait for it, but I am afraid it might take several years until all the infrastructure is in place and everybody upgrades to it.

- [0]: https://web.archive.org/web/20231222054207/https://www.polar...

On the third point, you have to do frame pointers across the whole Linux distro in order to be able to get good flamegraphs. You have to do whole system analysis to really understand what's going on. The way that current binary Linux distros (like Fedora and Debian) works makes any alternative impossible.

It could be one instruction: ENTER N,0 (where N is the amount of stack space to reserve for locals)---this is the same as:

    PUSH EBP
    MOV  ESP,ESP
    SUB  SP,N

(I don't recall if ENTER is x86-64 or not). But even with this, the frame setup isn't atomic with respect to CALL, and if the snapshot is taken after the CALL but before the ENTER, we still don't get the fame setup.

As for the reason why ENTER isn't used, it was deemed too slow. LEAVE (MOV SP,BP; POP BP) is used as it's just as fast as, if not faster, than the sequence it replaces. If ENTER were just the PUSH/MOV/SUB sequence, it probably would be used, but it's that other operand (which is 0 above in my example) that kills it performance wise (it's for nested functions to gain access to upper stack frames and is every expensive to use).

Great comments, thanks for sharing. The non-atomic frame setup is indeed problematic for CPU profilers, but it's not an issue for allocation profiling, Off-CPU profiling or other types off non-interrupt driven profiling. But as you mentioned, there might be ways to solve that problem.