Comment by delusional
14 hours ago
> By having a contiguous array of indices to look at, that array can be prefetched as it goes
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
On modern cpus? Most likely. Those kinds of optimizations are done by the core with no compiler magic needed.
CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.
Hardware definitely supports this but it might need compiler support, as in adding instructions to do prefetching. Which might be done automatically or requires a pragma or calling a builtin. So it can be implemented in any case.
The compiler probably does [0].
[0] https://gcc.gnu.org/projects/prefetch.html
That list doesn't include any current mainline processors. It's all Itanium, 3DNow!, and MIPS.
Intel added PREFETCHW to their Broadwell processors launched in 2014, years after AMD dropped all 3DNow! instructions except the prefetch instructions. That timeline strongly suggests that the instructions aren't no-ops and likely are used by some popular software.