Comment by tliltocatl

2 days ago

> pretty useful for compiler to be able to deduplicate/CSE loads/computations

Yes, but is it a performance improvement significant enough? L1 latency is single cycle. Is the performance improvement from eliminating that worth the trouble it brings to the application programmer?

L1 latency is 4 cycles typically (1 nanosecond would be closer). And of course it gets longer if you're chasing through multiple pointers.

It of course depends on the specific program, but, looking at any optimization at the level of separate impacted assembly intructions, everything other than mispredictions, division, and vectorization is "just a couple cycles" so that's not really a meaningful way to look at them.