Comment by spockz

1 day ago

Thank you for this interface! It will definitely help in tracking down GC related performance issues or in selecting optimal settings.

One thing that I still struggle with, is to see how much penalty our application threads suffer from other work, say GC. In the blog you mention that GC is not only impacting by cpu doing work like traversing and moving (old/live) objects but also the cost of thread pauses and other barriers.

How can we detect these? Is there a way we can share the data in some way like with OpenTelemetry?

Currently I do it by running a load on an application and retaining its memory resources until the point where it CPU skyrockets because of the strongly increasing GC cycles and then comparing the cpu utilisation and ratio between cpu used/work.

Edit: it would be interesting to have the GC time spent added to a span. Even though that time is shared across multiple units of work, at least you can use it as a datapoint that the work was (significantly?) delayed by the GC occurring, or waiting for the required memory to be freed.

3 comments

spockz

jonasn 1 day ago

Thanks for reading! Your current method, pushing the load until the GC spirals and then comparing the CPU utilization, is exactly the painful, trial-and-error approach I'm hoping this new API helps alleviate.

You've hit on the exact next frontier of GC observability. The API in JDK 26 tracks the explicit GC cost (the work done by the actual GC threads). Tracking the implicit costs, like the overhead of ZGC's load barriers or G1's write barriers executing directly inside your application threads, along with the cache eviction penalties, is essentially the holy grail of GC telemetry.

I have spent a lot of time thinking about how to isolate those costs as part of my research. The challenge is that instrumenting those barrier events in a production VM without destroying application throughput (and creating observer effects) is incredibly difficult. It is absolutely an area of future research I am actively thinking about, but there isn't a silver bullet for it in standard HotSpot just yet.

Something that you could look at there are some support to analyze with regards to thread pauses is time to safepoint.

Regarding OpenTelemetry. MemoryMXBean.getTotalGcCpuTime() is exposed via the standard Java Management API, so it should be able to hook into this.

spockz 11 hours ago
After writing my previous post I was wondering, do we actually need to instrument the barrier events and other code tied to a GC? Currently we benchmark our application with different GC at different settings and resource constraints and the we pick one sizing and settings combination that we like (read most work/totalcpu that is still fits within the allocation constraints of our clusters). What ultimately matters for production is how the app behaves in production.
This will not help directly when developing new (versions) or GC. On the other hand, if we can have a noop GC including omitting any of the barriers etc required for GC to function we can create a baseline for apps. Provided we have enough total memory to run the benchmark in.
Edit: I guess we can then also use perf to compare cache misses between runs with different GC implementations and settings. Not sure how this works out in real life as it will be very CPU, kernel, and other loads dependent.
- yvdriess 3 hours ago
  
  The problem is that there is no baseline for measuring GC overhead. You cannot turn it off, you can only replace and compare with different strategies. For example sbrk is technically a noop GC, but that also has overhead and impact because it will not compact objects and give you bad cache behavior. (It illustrates the OP's point that it is not enough to measure pauses, sbrk has no pauses but gets outperformed easily.)
  You could stop collecting performance counters around GC phases, but you even if you are not measuring the CPU still runs through its instructions, causing the second order effects. And as you mentioned too-short-to-measure barriers and other bookkeeping overheads (updating ref counters etc) or simply the fact that some tag bits or object slots are reserved all impact performance.
  There is a good write-up of the problem and a way to estimate the cost based on different GC strategies, as you suggested, here: https://arxiv.org/abs/2112.07880
  The way I found to measure a no-GC baseline is to compare them in an accurate workload performance simulator. Mark all GC and allocator related code regions and have the simulator skip all those instructions. Critically that needs to be a simulator that does not deal with the functional simulation, but gets it's instructions from a functional simulator, emulator or PIN tool that does execute everything. It's laborious, not very fast and impractical for production work. But, it's the only way I found to answer a question like "What is the absolite overhead of memory management in Python?". (Answer: lower bound walltime sits around +25% avg, heavily depending on the pyperformance benchmark)