← Back to context

Comment by rwmj

10 months ago

I'm glad he mentioned Fedora because it's been a tiresome battle to keep frame pointers enabled in the whole distribution (eg https://pagure.io/fesco/issue/3084).

There's a persistent myth that frame pointers have a huge overhead, because there was a single Python case that had a +10% slow down (now fixed). The actual measured overhead is under 1%, which is far outweighed by the benefits we've been able to make in certain applications.

I believe it's a misrepresentation to say that "actual measured overhead is under 1%". I don't think such a claim can be universally applied because this depends on the very workload you're measuring the overhead with.

FWIW your results don't quite match the measurements from Linux kernel folks who claim that the overhead is anywhere between 5-10%. Source: https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy...

   I didn't preserve the data involved but in a variety of workloads including netperf, page allocator microbenchmark, pgbench and sqlite, enabling framepointer introduced overhead of around the 5-10% mark.

Significance in their results IMO is in the fact that they measured the impact by using PostgreSQL and SQLite. If anything, DBMS are one of the best ways to really stress out the system.

  • Those are numbers from 7 years ago, so they're beginning to get a bit stale as people start to put more weight behind having frame pointers and make upstream contributions to their compilers to improve their output. People put it at <1% from much more recent testing by the very R.W.M. Jones you're replying to [0] and separate testing by others like Brendan Gregg [1b], whose post this is commenting on (and included [1b] in the Appendix as well), with similar accounts by others in the last couple years. Oh, and if you use flamegraph, you might want to check the repo for a familiar name.

    Some programs, like Python, have reported worse, 2-7% [2], but there is traction on tackling that [1a] (see both rwmj's and brendangregg's replies to sibling comments, they've both done a lot of upstreamed work wrt. frame pointers, performance, and profiling).

    As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles. Plus, you can always disable it in specific hotspots later when needed, which is much easier than the reverse.

    Something, something, premature optimisation -- though in seriousness, this information benefits actual optimisation, exactly because we don't have the information and understanding that would allow truly universal claims, precisely because things like this haven't been available, and so haven't been widely used. We know frame pointers, from additional register pressure and extended function prologue/epilogue, can be a detriment in certain hotspots; that's why we have granular control. But without them, we often don't know which hotspots are actually affected, so I'm sure even the databases would benefit... though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency, so even a claimed "10%" drop there probably doesn't impact actual real-world usage, but that's a reason why some of the most interesting profiling work lately has been from ideas like causal profilers and continuous profilers, which answer exactly that.

    [0]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar... [1a]: https://pagure.io/fesco/issue/2817#comment-826636 [1b]: https://pagure.io/fesco/issue/2817#comment-826805 [2]: https://discuss.python.org/t/the-performance-of-python-with-...

    • While improved profiling is useful, achieving it by wasting a register is annoying, because it is just a very dumb solution.

      The choice made by Intel when they have designed 8086 to use 2 separate registers for the stack pointer and for the frame pointer was a big mistake.

      It is very easy to use a single register as both the stack pointer and the frame pointer, as it is standard for instance in IBM POWER.

      Unfortunately in the Intel/AMD CPUs using a single register is difficult, because the simplest implementation is unreliable since interrupts may occur between 2 instructions that must form an atomic sequence (and they may clobber the stack before new space is allocated after writing the old frame pointer value in the stack).

      It would have been very easy to correct this in new CPUs by detecting that instruction sequence and blocking the interrupts between them.

      Intel had already done this once early in the history of the x86 CPUs, when they have discovered a mistake in the design of the ISA, that interrupts could occur between updating the stack segment and the stack pointer. Then they had corrected this by detecting such an instruction sequence and blocking the interrupts at the boundary between those instructions.

      The same could have been done now, to enable the use of the stack pointer as also the frame pointer. (This would be done by always saving the stack pointer in the top of the stack whenever stack space is allocated, so that the stack pointer always points to the previous frame pointer, i.e. to the start of the linked list containing all stack frames.)

    • I'd prefer discussing technical merits of given approach rather than who is who and who did what since that leads to appeal to authority fallacy.

      You're correct, results might be stale, although I wouldn't hold my breath for it since there has been no fundamental change in a way how frame pointers are handled as far as my understanding goes. Perhaps smaller improvements in compiler technology but CPUs did not undergo any significant change w.r.t. that context.

      That said, nowhere in this thread have we seen a dispute of those Linux kernel results other than categorically rejecting them as being "microbenchmarks", which they are not.

      > though the "my database is the fastest database" problem has always been the result of endless micro-benchmarking, rather than actual end-to-end program performance and latency

      Quite the opposite. All database benchmarks are end-to-end program performance and latency analysis. "Cheating" in database benchmarks is done elsewhere.

    • > As has been frequently pointed out, the benefits from improved profiling cannot be understated, even a 10% cost to having frame pointers can be well worth it when you leverage that information to target the actual bottlenecks that are eating up your cycles.

      Few can leverage that information because the open source software you are talking about lacks telemetry in the self hosted case.

      The profiling issue really comes down to the cultural opposition in these communities to collecting telemetry and opening it for anyone to see and use. The average user struggles to ally with a trustworthy actor who will share the information like profiling freely and anonymize it at a per-user level, the level that is actually useful. Such things exist, like the Linux hardware site, but only because they have not attracted the attention of agitators.

      Basically users are okay with profiling, so long as it is quietly done by Amazon or Microsoft or Google, and not by the guy actually writing the code and giving it out for everyone to use for free. It’s one of the most moronic cultural trends, and blame can be put squarely on product growth grifters who equivocate telemetry with privacy violations; open source maintainers, who have enough responsibilities as is, besides educating their users; and Apple, who have made their essentially vaporous claims about privacy a central part of their brand.

      Of course people know the answer to your question. Why doesn’t Google publish every profile of every piece of open source software? What exactly is sensitive about their workloads? Meta publishes a whole library about every single one of its customers, for anyone to freely read. I don’t buy into the holiness of the backend developer’s “cleverness” or whatever is deemed sensitive, and it’s so hypocritical.

      8 replies →

You probably already know, but with OCaml 5 the only way to get flamegraphs working is to either:

* use framepointers [1]

* use LBR (but LBR has a limited depth, and may not work on on all CPUs, I'm assuming due to bugs in perf)

* implement some deep changes in how perf works to handle the 2 stacks in OCaml (I don't even know if this would be possible), or write/adapt some eBPF code to do it

OCaml 5 has a separate stack for OCaml code and C code, and although GDB can link them based on DWARF info, perf DWARF call-graphs cannot (https://github.com/ocaml/ocaml/issues/12563#issuecomment-193...)

If you need more evidence to keep it enabled in future releases, you can use OCaml 5 as an example (unfortunately there aren't many OCaml applications, so that may not carry too much weight on its own).

[1]: I haven't actually realised that Fedora39 has already enabled FP by default, nice! (I still do most of my day-to-day profiling on an ~CentOS 7 system with 'perf record --call-graph dwarf -F 47 -a', I was aware that there was a discussion to enable FP by default, but haven't noticed it has actually been done already)

Frame pointers are still a no-go on 32bit so anything that is IoT today.

The reason we removed them was not a myth but comes from the pre-64 bit days. Not that long ago actually.

Even today if you want to repurpose older 64 bit systems with a new life then this of optimization still makes sense.

Ideally it should be the default also for security critical systems because not everything needs to be optimized for "observability"

  • > Frame pointers are still a no-go on 32bit so anything that is IoT today.

    Isn't that just 32-bit x86, which isn't used in IoT? The other 32-bit ISAs aren't register-starved like x86.

    • It would be, yes. x86 had very few registers, so anything you could do to free them up was vital. Arm 32bit has 32 general purpose registers I think, and RISC V certainly does. In fact there's no difference between 32 and 64 bit in that respect. If anything, 64-bit frame pointers make it marginally worse.

      3 replies →

Thanks; what was the Python fix?

I'm still baffled by this attitude. That "under 1%" overhead is why computers are measurably slower than 30 years ago to use. All those "under 1%" overhead add up