How much slower is random access, really?

4 days ago (samestep.com)

51 comments

sestep

Note this is not true random access in the manner it occurs in most programs. By having a contiguous array of indices to look at, that array can be prefetched as it goes, and speculative execution will take care of loading many upcoming indices of the target array in parallel.

A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.

wtallis 12 hours ago
> A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.
- less_less 6 hours ago
  
  The data-dependent prefetcher is a cool feature, though you do have to be careful with side-channel issues, so some of them can disable it with the Data-Independent Timing bit or similar.
  At this point I'm kinda expecting CPU vendors to stop putting as many Spectre mitigations in the main core, and just have a small crypto core with full-fat arithmetic, less hardware for memory access, less speculation, and careful side-channel hardening. You still have to block Meltdown and other large vulnerabilities on the main cores, but if someone wants to protect elliptic curves from weird attacks? Try to set the DIT bit, trap into the OS, and get sent to the hardened core.
- gpderetta 5 hours ago
  
  Even if the prefetcher was capable of traversing pointers, it wouldn't help. The hypothetical benchmark wouldn't do anything other chasing pointers, and the prefetcher can't really do that any quicker. A traversing prefetcher is useful if the code actually does work for each traversed node, then the prefetcher (or the OoO machinery) could realistically run ahead.
- deepsun 8 hours ago
  
  Could probably overcome that by using integers, but converting them to a pointer after accessing (like '0'+1 is '1').
  
  1 reply →
hansvm 12 hours ago
Fun fact, that's part of why parsing protobuf is so slow.
- elcritch 2 hours ago
  
  Indirection kills performance nowadays. I did a bunch of benchmarking a couple years back and found that you can parse MessagePack and CBOR faster than Protobuf if you know the types and serialize directly into them. Even if field order isn't known and you use non-allocated field strings.
  Well in a language that allows you to generate compile time specialized serde code like Nim or Zig. Maybe C++ has enough compile time reflection to do it now as well?
  I don't know enough about Rust's serde, but it seems like there'd be a lot of performance overhead with it's design and the limits of Rust's macro and compiler system.
  
  1 reply →
jltsiren 9 hours ago
If the next index is stored in the target array and the indexes are random, you will likely get a cycle of length O(sqrt(n)), which can be cached.
You can avoid this with two arrays. One contains random query positions, and the target array is also filled with random values. The next index is then a function of the next query position and the previous value read from the target array.
- eru 7 hours ago
  
  You could sample from a different random distribution.
  Eg start with every element referencing the next element (i.e. i+1 with wrap-around), and then use a random shuffle. That way, you preserve the full cycle.
  
  2 replies →
jiggawatts 15 hours ago

This is why array random access and linked-list random access have wildly different performance characteristics.
Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.
Without this overhead, small array random access should have a lot better per-element cost.
kortilla 7 hours ago
The article very clearly compares using randomized indexes and sequential. It’s kinda the point of the article.
- cb321 7 hours ago
  
  You seem to misunderstand @andersa's point which I think is well expressed - it doesn't matter if the indices are randomized if the CPU can pre-fetch what they will be. The power of CPU speculative execution to hide latency can be quite surprising the first time you see it.
  This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).
  EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.
  
  2 replies →
delusional 9 hours ago
> By having a contiguous array of indices to look at, that array can be prefetched as it goes
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
- phi-go 8 hours ago
  
  Hardware definitely supports this but it might need compiler support, as in adding instructions to do prefetching. Which might be done automatically or requires a pragma or calling a builtin. So it can be implemented in any case.
- shakna 7 hours ago
  
  The compiler probably does [0].
  [0] https://gcc.gnu.org/projects/prefetch.html
  
  2 replies →

porcoda 15 hours ago

The RandomAccess (or GUPS) benchmark (see: https://ieeexplore.ieee.org/document/4100365) was looking at measuring machines on this kind of workload. In high performance computing this was important for graph calculations and was one of the things the Cray (formerly Tera) MTA machine was particularly good at. I suppose this benchmark wouldn’t be very widely known outside HPC circles.

jandrewrogers 12 hours ago
I worked on the MTA architectures for years among several other HPC systems but I don’t remember this particular benchmark. I suspect it was replaced by the Graph500 benchmark. Graph500 measures something similar and was introduced only a few years after GUPS.
- porcoda 11 hours ago
  
  The HPCS benchmarks predated Graph500. They were talked about at SC for a few years in the early 2000s but mostly faded into the background. It’s hard to dig up the numbers for the MTA on RandomAccess, but the Eldorado paper from ‘05 by Feo and friends (https://dl.acm.org/doi/10.1145/1062261.1062268) mentions it and you can see the MTA beating the other popular architectures of the time in one of the tables.
  
  1 reply →

archi42 4 hours ago

What surprises me is the 24 GB of DDR4 DRAM on a dual channel memory controller? AFAIK there are only 8 GB or 16 GB modules, no 12 GB modules. At least I can only find 12 GB DDR5 modules listed, but not DDR4.

This means: The system likely uses 3x 8 GB modules. As a result, one channel has two modules with 16 GB total, while the other channel has only a single 8 GB module.

Not sure how big this impact is with the given memory access patterns and assuming [mostly] exclusive single-threaded access. It's just something I noted, and could be a source of unexpected artifacts.

CodesInChaos 1 hour ago

Could be 2x8 + 2x4. Mine has 2x32 + 2x8, since I upgraded from 16 to 80 instead of 64.

JonChesterfield 9 hours ago

Random access is catastrophically slower because of the successive cache misses when the prefetcher fails to guess what you're doing.

One hint in the same article that random access is not cheap, in contrast with the conclusion, was noticing that the shuffle was unacceptably slow on large data sets.

Still, good to see peformance measurements, especially where the curves look roughly like you'd hope them to.

Animats 9 hours ago

If, of course, you have the CPU and its caches all to yourself.

anonymousDan 5 hours ago

Would a better benchmark not just use some kind of pseudo randomly generated sequence to avoid having two arrays?

Adhyyan1252 16 hours ago

Love this analysis! Was expecting random to be much slower. 4x is not bad at all

Nevermark 10 hours ago

There has to be some power hit for all those extra cache fills. No idea if it would be measurable.

b0a04gl 8 hours ago

longtime back ran a model scoring job where each row did feature lookups across diff memory regions. access pattern changed every run since it depended on prev feature values. avg latency stayed fine but total run time kept jumping sometimes 6s, sometimes 8s on same input. perf counters looked flat but thread scheduling kept shifting. reordered the lookups to cut down mem jumps. after that threads aligned better, scheduler ran smoother, run time got stable across batches. gain wasn’t from faster mem, imo i observed it's not the memory is slower but rather how less predictable it's

o11c 10 hours ago

Hm, no discussion of cache line size, page size, or the limits of cache associativity?

FpUser 12 hours ago

I did another type of experiment which evaluates benefits of branch prediction on AMD 9950X on contiguous array with 1,000,000 elements. Calculated sum adding element if it is bigger than 125 (50% of 256). Difference between random and sorted was 10 times. I guess branch prediction plays a huge role as well.

Andys 12 hours ago
Thanks for sharing that.
Presumably if you'd split the elements into 16 shares (one for each CPU), summed with 16 threads, and then summed the lot at the end, then random would be faster than sorted?
- bee_rider 9 hours ago
  
  I don’t think random should be faster than contiguous access, if you parallelize both of them.
  Although, it looks like that chip has a 1MB L2 cache for each core. If these are 4 Bytes ints, then I guess they won’t all fit in one core’s L2, but maybe they can all start out in their respective cores’ L2 if it is parallelized (well, depends on how you set it up).
  Maybe it will be closer. Contiguous should still win.

Surac 9 hours ago

is this not just a memory test for the burst capacitiy and access strategy of the dram controller?

forrestthewoods 15 hours ago

Here’s an older blog post of mine on roughly the same topic:

https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...

I’m not sure I agree with the data presentation format. “time per element” doesn’t seem like the right metric.

klank 13 hours ago
What are your qualms with time per element? I liked it as a metric because it kept the total deviation of results to less than 32 across the entire result set.
Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.
If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.
However, having said all that, I'd love to hear what your reservations are using it as a metric.
- forrestthewoods 9 hours ago
  
  It’s not wrong per se. I’m just very wary of nano-scale benchmarks. And I think in general you should advertise “velocity” not “time per”.
  Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...
  I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.
  Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!
  The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.
  On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).
  There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.
  I dunno. Those are all just surface level reactions.
alain94040 13 hours ago
From your blog post:
> Random access from the cache is remarkably quick. It's comparable to sequential RAM performance
That's actually expected once you think about it, it's a natural consequence of prefetching.
- delusional 9 hours ago
  
  If that wasn't the case the machine would have to prefetch to register file. I don't know of any CPU that does that.
- forrestthewoods 7 hours ago
  
  Heh. That line often gets called out.
  Lots of things are expected when you deeply understand a complex system and think about it. But, like, not everyone knows the system that deeply nor have they thought about it!
petermcneeley 11 hours ago

Whats most misleading is the data for the smaller sizes (1k)