Comment by sweetjuly
5 days ago
I wonder how much of the cost is coming from the cache misses vs the more frequent indirections/ILP drop?
For example, I wonder what this test looks like if you don't randomize the chunks but instead just have the chunks in work order? If you still see the perf hit, that suggests the cost is not from the cache misses but rather the overhead of needing to switch chunks more often.
that's a bit what the "repeated" scenario (roughly middle of the post) measures. It's not in work order but it is the same order every time, so caches work. And there you see that the working set size matters.
Note that the base setup has zero cache reuse because each run touches a completely different and cold part of memory. (that makes the result more of an upper bound on the needed chunk size)