← Back to context

Comment by loeg

10 months ago

This also seems like a big objection:

> The overhead to walk DWARF is also too high, as it was designed for non-realtime use.

Not a problem in practice. The way you solve it is to just translate DWARF into a simpler representation that doesn't require you to walk anything. (But I understand why people don't want to do it. DWARF is insanely complex and annoying to deal with.)

Source: I wrote multiple profilers.

  • For a busy 64-CPU production JVM, I tested Google's Java symbol logging agent that just logged timestamp, symbol, address, size. The c2 compiler was so busy, constantly, that the overhead of this was too high to be practical (beyond startup analysis). And all this was generating was a timestamp log to do symbol lookup. For DWARF to walk stacks there's a lot more steps, so while I could see it work for light workloads I doubt it's practical for the heavy production workloads I typically analyze. What do you think? Have you tested on a large production server where c2 is a measurable portion of CPU constantly, the code cache is >1Gbyte and under heavy load?

    • I regularly profile heavy time-sensitive (as in: if the code takes too long to run it breaks) workloads, and I even do non-sampling memory profiling (meaning: on every memory allocation and deallocation I grab a full backtrace, which is orders of magnitude more data than normal sampling profiling) and it works just fine with minimal slowdown even though I get the unwinding info from vanilla DWARF.

      Granted, this is using optimized tooling which uses a bunch of tricks to side-step the problem of DWARF being slow, I only profile native code (and some VMs which do ahead-of-time codegen) and I've never worked with JVM, but in principle I don't see why it wouldn't be practical on JVM too, although it certainly would be harder and might require better tooling (which might not exist currently). If you have the luxury of enabling frame pointers then that certainly would be easier and simpler.

      (Somewhat related, but I really wish we would standardize on something better than DWARF for unwinding tables and basic debug info. Having done a lot of work with DWARF and its complexity I wouldn't wish it upon my worst enemy.)

  • In this thread[1] we're discussing problems with using DWARF directly for unwinding, not possible translations of the metadata into other formats (like ORC or whatever).

    [1]: https://news.ycombinator.com/item?id=39732010

    • I wasn't talking about other formats. I was talking about preloading the information contained in DWARF into a more efficient in-memory representation once when your profiler starts, and then the problem of "the overhead is too high for realtime use" disappears.

From https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf

    DWARF-based unwinding can be a bottleneck for time-sensitive program analysis tools. For instance the perf profiler is forced to copy the whole stack on taking each sample and to build the backtraces offline: this solution has a memory and time overhead but also serious confidentiality and security flaws.

So if I get this correctly, the problem with DWARF is that building the backtrace online (on each sample) in comparison to frame pointers is an expensive operation which, however, can be mitigated by building the backtrace offline at the expense of copying the stack.

However, paper also mentions

    Similarly, the Linux kernel by default relies on a frame pointer to provide reliable backtraces. This incurs in a space and time overhead; for instance it has been reported (https://lwn.net/Articles/727553/) that the kernel’s .text size increases by about 3.2%, resulting in a broad kernel-wide slowdown.

and

    Measurements have shown a slowdown of 5-10% for some workloads (https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy@suse.de/T/#u).

But that one has at least some potential mitigation. Per his analysis, the Java/JIT case is the only one that has no mitigation:

> Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.