Comment by VorpalWay

19 days ago

> However the big difference (in most languages) is that functions can take arbitrarily long. Array access either succeeds or fails quickly.

For some definition of quick. Modern CPUs are usually bottlenecked by memory bandwidth and cache size. So a function that recomputes the value can often be quicker than a look up table, at least outside of microbenchmarks (since in microbenchmarks you won't have to compete with the rest of the code base about cache usage).

Of course this depends heavily of how expensive the function is, but it is worth having in mind that memory is not necessarily quicker than computing again. If you need to go to main memory, you have something on the order of a hundred ns that you could be recomputing the value in instead. Which at 2 GHz would translate to 200 clock cycles. What that means in terms of number of instructions depends on the instruction and number of execution units you can saturated in the CPU, if the CPU can predict and prefetch memory, branch prediction, etc. But RAM is really slow. Even with cache you are talking single digit ns to tens of ns (depending on if it is L1, L2 or L3).

I've been watching those Kaze Emanuar videos on his N64 development, and it's always so weird to me when "doing the expensive computation again" is cheaper than "using the precomputed value". I'm not disputing it, he seems to have done a lot of research and testing confirming the results and I have no reason to think he's lying, but it's so utterly counter-intuitive to me.

  • I haven't looked into N64, but the speed of CPUs has been growing faster than the speed of RAM for decades. I'm not sure when exactly that started, probably some time in the late 80s or early 90s, since that is about when PCs started getting cache memory I believe.

    • I wonder if a breakpoint was out-of-order execution. Many computations would use some values from memory plus other that could be computed, and out-of-order execution would allow the latter to proceed while waiting on memory for the former. That would improve utilization and be a 'win' even if the recomputation in isolation would be no faster than the memory load.

    • The N64 was just really weirdly designed: they went with an overpowered CPU for bragging rights, and bet on the wrong RAM horse, Rambus.

  • I used to develop for the N64 and I can confirm that it is true. It is crazy how much faster the CPU is compared with not-in-cache RAM access.

    Optimizing for RAM access instead of CPU instruction speed can make your code magnitudes faster.

  • Doing maths is extremely fast. You need a lot of maths to get to the same amount of time as a single memory access that is not cached in L1 or L2.

    • And you need to burn even more cycles before you’ve amortized the cost of using a cache line that could have benefitted some other work.

> For some definition of quick. Modern CPUs are usually bottlenecked by memory bandwidth and cache size.

I meant in most languages functions aren't guaranteed to return in finite time at all.