Comment by Sesse__

1 month ago

Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.

Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!

With the older microarchitectures there was a large penalty for crossing a cache line with AVX-512. In some cases, the performance could be worse than AVX2!

In older microarchitectures like Ice Lake it was pretty bad, so you wanted to avoid unaligned loads if you could. This penalty has rapidly shrunk across subsequent generations of microarchitectures. The penalty is still there but on recent microarchitectures it is small enough that the unaligned case often isn't a showstopper.

The main reason to use aligned loads in code is to denote cases where you expect the address to always be aligned i.e. it should blow up if it isn't. Forcing alignment still makes sense if you want predictable, maximum performance but it isn't strictly necessary for good performance on recent hardware in the way it used to be.

AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.

The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.

  • > AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction.

    Hah, TIL. Too used to SSE, I guess. (My main target platform is, unfortunately, still limited to SSE3, not even SSSE3.)