Comment by loeg

1 day ago

This seems like more of a philosophical argument than a practical one.

No, it is a very practical one and I'm actually surprised that you don't see it that way. Benchmarking is hard, and if you don't understand the basics then you can easily measure nonsense.

  • You raise a fair point about the percentiles. Those are reported as point estimates without confidence intervals and the implied precision overstates what system clock can deliver.

    The mean does get proper statistical treatment (t-distribution confidence interval), but you're right that JMH doesn't compute confidence intervals for percentiles. Reporting p0.00 with three significant figures is ... optimistic.

    That said I think the core finding survives this critique. The improvement shows up consistently across ~11 million samples at every percentile from p0.50 through p0.999.

    • Yes, I would expect the 'order of magnitude' value to be relatively close but the absolute values to be very imprecise.

    • You can compute the confidence intervals all you want but if you can't be sure, in one or another way, that what you're observing (measuring) in your experiment is what you actually wanted to measure (signal), not even confidence interval would help you there to distinguish between the signal and noise.

      That said, at your CPU base frequency, 80ns is ~344 cycles, 70ns is ~300 cycles. That's ~40 cycles of difference. That's on the order of ~2x CPU pipeline flushes due to branch mispredictions. Or another example is RDTSCP which, at least on Intel CPUs, forces all prior instructions to retire before executing, and it prevents speculative execution of following instructions until theirs results are available. This can also impose a 10-30 cycle penalty. Both of these can interfere with the measurements of the scale you have so there is a possibility that you're measuring these effects instead of the optimization you thought you implemented.

      I am not saying that this is the case, I am just saying it's possible. Since the test is simple enough I would eliminate other similar CPU level gotchas that can screw your hypothesis testing up. In more complex scenarios I would have to consider them as well.

      The only reliable way I found to be sure what is really happening is to read the codegen. And I do that _before_ each test run, or to be more precise after each recompile, because compilers do crazy transformations with our code, even when just moving a naively looking function few lines above or adding some naive boolean flag. If I don't do that, I could again end up measuring, observing, and finally drawing the conclusion that I implemented a speedup without realizing that the compiler in that last case decided to eliminate half of the code because of that innocuous boolean flag. Just an example.

      radix tree lookup looks interesting and it would be interesting to see at what exact instruction does it idle on. I had a case where the function would be sitting idle, reproducible, but when you look into the function there is nothing obvious you can optimize. It turned out that the CPU pipeline was so saturated that there were no more available CPU ports for the instruction this function was idling for. The fix was to rewrite code elsewhere but in vicinity of this function. This is something flamegraphs can never show you, which is partly the reason I had never been a huge fan of.