Comment by atq2119
6 months ago
Yep, same with overlapping unaligned loads. It's just fairly cheap to make that stuff pipelined and run fast. It's only when you mix loads and stores in the same memory region that there are conflicts that can slow you down (and then quite horribly actually, depending on the exact processor).
The place where I see this really hurts goes when Clang/LLVM gets too fancy, in situations like this:
Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.
This happens with partial autovectorization, too. Compiler fails to vectorize a first loop and then vectorizes the second, result is a store forwarding failure at the start of the second loop trying to read the output of the first loop, erasing the vectorization gains.