Comment by dap
10 months ago
Good post!
> Profiling has been broken for 20 years and we've only now just fixed it.
It was a shame when they went away. Lots of people, certainly on other systems and probably Linux too, have found the absence of frame pointers painful this whole time and tried to keep them available in as many environments as possible. It’s validating (if also kind of frustrating) to see mainstream Linux bring them back.
I’m sincerely curious. While I realize that using dwarf for unwinding is annoying, why is it so bad that it’s worth pessimizing all code on the system? It’s slow on Debian derivatives because they package only the slow unwinding path for perf for example, for license reasons, but with decent tooling I barely notice the difference. What am I missing?
Using DWARF is annoying with perf because the kernel doesn't support stack unwinding with DWARF, and never will. The kernel has to be the one which unwinds the user space stacks, because it is the one managing the perf counters and handling the interrupts.
Since it can't unwind the user space stacks, the kernel has to copy the entire stack (8192 bytes by default) into the output file (perf.data) and then the perf user space program will unwind it later. It does this for each sample, which is usually hundreds of times per second, per CPU. Though it depends how you configured the collection.
That does have a significant overhead: first, runtime overhead: copying 8k bytes, hundreds of times per second, and writing it to disk, all don't come for free. You spend quite a bit of CPU time doing the memcpy operation which consumes memory bandwidth too. You also frequently need to increase the size of the perf memory buffer to accommodate all this data while it waits for user space to write it to disk. Second, disk space overhead, since the 8k stack bytes per sample are far larger than the stack trace would be. And third, it does require that you install debuginfo packages to get the DWARF info, which is usually a pain to do on production machines, and they consume a lot of disk space on their own.
Many of these overheads aren't too bad in simple cases (lower sample rates, fewer CPUs, or targeting a single task). But with larger machines with hundreds of CPUs, full system collections, and higher frequencies, the overhead can increase exponentially.
I'm not certain I know what you mean by the "slow unwinding path for perf", as there is no faster path for user space when frame pointers are disabled (except Intel LBR as outlined in the blog).
I assume you're talking about the “nondistro build”?
The difference between -g (--call-graph fp) and --call-graph dwarf is large even with perf linked directly against binutils, at least in my experience (and it was much, much worse until I got some patches into binutils to make it faster). This is both on record and report.
There are also weird bugs with --call-graph dwarf that perf upstream isn't doing anything about, around inlining. It's not perfect by any means.
Which is the slow unwinding path? The one from libbfd?