Comment by yvdriess
20 hours ago
The problem is that there is no baseline for measuring GC overhead. You cannot turn it off, you can only replace and compare with different strategies. For example sbrk is technically a noop GC, but that also has overhead and impact because it will not compact objects and give you bad cache behavior. (It illustrates the OP's point that it is not enough to measure pauses, sbrk has no pauses but gets outperformed easily.)
You could stop collecting performance counters around GC phases, but you even if you are not measuring the CPU still runs through its instructions, causing the second order effects. And as you mentioned too-short-to-measure barriers and other bookkeeping overheads (updating ref counters etc) or simply the fact that some tag bits or object slots are reserved all impact performance.
There is a good write-up of the problem and a way to estimate the cost based on different GC strategies, as you suggested, here: https://arxiv.org/abs/2112.07880
The way I found to measure a no-GC baseline is to compare them in an accurate workload performance simulator. Mark all GC and allocator related code regions and have the simulator skip all those instructions. Critically that needs to be a simulator that does not deal with the functional simulation, but gets it's instructions from a functional simulator, emulator or PIN tool that does execute everything. It's laborious, not very fast and impractical for production work. But, it's the only way I found to answer a question like "What is the absolite overhead of memory management in Python?". (Answer: lower bound walltime sits around +25% avg, heavily depending on the pyperformance benchmark)
No comments yet
Contribute on Hacker News ↗