Comment by Remnant44
1 month ago
AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.
The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.
> AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction.
Hah, TIL. Too used to SSE, I guess. (My main target platform is, unfortunately, still limited to SSE3, not even SSSE3.)