← Back to context

Comment by cb321

11 hours ago

You seem to misunderstand @andersa's point which I think is well expressed - it doesn't matter if the indices are randomized if the CPU can pre-fetch what they will be. The power of CPU speculative execution to hide latency can be quite surprising the first time you see it.

This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).

EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.

In the end it depends exactly what you want to measure. Of course a load-load dependency will make everything as slow as the latency of the cache level you are accessing as that becomes the bottleneck.

Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).

At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.

Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.

[1] Even many in-order CPUs have some run-ahead capabilities for memory.

  • All true. No real disagreement and I'm often saying "it all depends.." myself. :-) In this case, there is also some vagueness around "random" (predictable to what subsystem when).

    I still suspect @kortilla is one of today's lucky 10,000 (https://xkcd.com/1053/) or just read/replied too quickly. :-)

    There is a lot written that indicates that the complexity of modern CPUs is ill-disseminated. But there is also wonderful stuff like https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... { To add a couple links to underwrite my reply in full agreeance. :-) }