Comment by andersa
1 day ago
Note this is not true random access in the manner it occurs in most programs. By having a contiguous array of indices to look at, that array can be prefetched as it goes, and speculative execution will take care of loading many upcoming indices of the target array in parallel.
A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
> A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.
The data-dependent prefetcher is a cool feature, though you do have to be careful with side-channel issues, so some of them can disable it with the Data-Independent Timing bit or similar.
At this point I'm kinda expecting CPU vendors to stop putting as many Spectre mitigations in the main core, and just have a small crypto core with full-fat arithmetic, less hardware for memory access, less speculation, and careful side-channel hardening. You still have to block Meltdown and other large vulnerabilities on the main cores, but if someone wants to protect elliptic curves from weird attacks? Try to set the DIT bit, trap into the OS, and get sent to the hardened core.
Even if the prefetcher was capable of traversing pointers, it wouldn't help. The hypothetical benchmark wouldn't do anything other chasing pointers, and the prefetcher can't really do that any quicker. A traversing prefetcher is useful if the code actually does work for each traversed node, then the prefetcher (or the OoO machinery) could realistically run ahead.
Could probably overcome that by using integers, but converting them to a pointer after accessing (like '0'+1 is '1').
Do you mean storing the next index/offset and having the pointer value calculated as late as possible by adding the starting address (and maybe multiplying the index by sizeof)? That would probably defeat/mislead Intel's prefetcher, as described at https://www.intel.com/content/www/us/en/developer/articles/t...
Fun fact, that's part of why parsing protobuf is so slow.
Indirection kills performance nowadays. I did a bunch of benchmarking a couple years back and found that you can parse MessagePack and CBOR faster than Protobuf if you know the types and serialize directly into them. Even if field order isn't known and you use non-allocated field strings.
Well in a language that allows you to generate compile time specialized serde code like Nim or Zig. Maybe C++ has enough compile time reflection to do it now as well?
I don't know enough about Rust's serde, but it seems like there'd be a lot of performance overhead with it's design and the limits of Rust's macro and compiler system.
By the way, DAG-CBOR and dCBOR enforce sorted map keys*, in which case you always know the field order.
*maddeningly, with mutually incompatible sorting rules.
Great point thanks, and I agree! I thought about also including another experiment for this "linked list"-style access pattern to see what the difference in performance is, but didn't get around to it. Maybe I'll write a followup post doing that.
If the next index is stored in the target array and the indexes are random, you will likely get a cycle of length O(sqrt(n)), which can be cached.
You can avoid this with two arrays. One contains random query positions, and the target array is also filled with random values. The next index is then a function of the next query position and the previous value read from the target array.
You could sample from a different random distribution.
Eg start with every element referencing the next element (i.e. i+1 with wrap-around), and then use a random shuffle. That way, you preserve the full cycle.
You can do that, but it's inefficient with larger arrays. Iterating over 10 billion elements in a random order can take up to an hour. Which is probably more than what you are willing to wait for a single case in a benchmark. On the other hand, you will probably find a cycle in a uniformly random array of 10 billion elements within 0.1 seconds, which is not enough to mitigate the noise in the measurements. So you need a way of generating unpredictable query positions for at least a few seconds without wasting too much time setting it up.
1 reply →
This is why array random access and linked-list random access have wildly different performance characteristics.
Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.
Without this overhead, small array random access should have a lot better per-element cost.
To be clear, the overhead on the left part is only due to file access for the last two graphs (the "direct" summation ones with just one blue line). For all the charts with both blue and yellow lines, there is no file access happening on the left hand side of the graphs, since the file gets read into memory first and then the measurements are run.
> By having a contiguous array of indices to look at, that array can be prefetched as it goes
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
On modern cpus? Most likely. Those kinds of optimizations are done by the core with no compiler magic needed.
CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.
Hardware definitely supports this but it might need compiler support, as in adding instructions to do prefetching. Which might be done automatically or requires a pragma or calling a builtin. So it can be implemented in any case.
The compiler probably does [0].
[0] https://gcc.gnu.org/projects/prefetch.html
That list doesn't include any current mainline processors. It's all Itanium, 3DNow!, and MIPS.
1 reply →
The article very clearly compares using randomized indexes and sequential. It’s kinda the point of the article.
You seem to misunderstand @andersa's point which I think is well expressed - it doesn't matter if the indices are randomized if the CPU can pre-fetch what they will be. The power of CPU speculative execution to hide latency can be quite surprising the first time you see it.
This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).
EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.
In the end it depends exactly what you want to measure. Of course a load-load dependency will make everything as slow as the latency of the cache level you are accessing as that becomes the bottleneck.
Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).
At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.
Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.
[1] Even many in-order CPUs have some run-ahead capabilities for memory.
1 reply →