Comment by ozgrakkurt

6 hours ago

The dataset they use is <14GB of parquet [1] so the "cold start" seems to be intended to also measure having a dataset that doesn't fit in memory in a way.

I don't think this is an oversight but it is just what they found to be feasible. This is explicitly written in [1]. Also the guy who setup this benchmark is very serious about benchmarking under difficult conditions [2]

My personal opinion is that you need a massive amount of data and massive number of different variables to test for separately. For example you might want to monitor how many cache misses/hits there were, p99 latency etc. And you want to do it under full load, expected load etc. And you want to compare the different versions of the same database because comparing different databases makes things combinatorially more difficult, unless you have a real production use case that you are optimizing for ofc.

The swisstable talk on cppcon is a good example of a useful benchmark and optimization that shows how difficult it is to really asses performance effects of even "small" changes. [3]

[1] https://github.com/ClickHouse/ClickBench#data-loading

[2] https://www.youtube.com/watch?v=CAS2otEoerM

[3] https://www.youtube.com/watch?v=ncHmEUmJZf4

Yeah, the tl;dr is that benchmarking is freaking hard because what you actually care about is "does my workload today and in the future run better or worse given current setup?" but identifying what your workload actually is, what systems you are going to be allowed to run it on, what tweaks would even be possible if you know the interiors of a system and how it aligns with your hardware, and it all comes with the price tag of "and if you do anything different tomorrow with any of these variables it might not hold."

  • Yeah, but also, I want to know the p50 warm performance, not just the p99. Run the same query twice in a row after cold start. And then another 10 times. Then do another different set of queries and at the end of the day, or a week, still have no real idea how the system will perform in prod for your particular use case.

    Benchmarking is hard, no argument from me!

    • Yep, I actually want to know the system has some sort of baseline performance that only hockey sticks under conditions I can monitor and control... but also the business wants to try new feature X and vendor is promising new performance for feature Y, and new patches are coming in affecting ???.