Comment by adrian_b

5 hours ago

Most 512-bit instructions are not split into two 256-bit instructions, neither on Zen 4, nor on laptop Zen 5. This is a myth caused by a very poor choice of words of the AMD CEO at the initial Zen 4 presentation.

For most 512-bit instructions that operate on the vector registers, both Zen 4 and all the Intel CPUs supporting AVX-512 have an identical throughput: two 512-bit instructions per clock cycle.

There are only a few instructions where Zen 4 is inferior to the most expensive of the Intel server/workstation CPUs, but those are important instructions for some applications.

The Intel CPUs have a double throughput for the transfers with the L1 cache memory. Zen 4 can do only one 512-bit load per cycle plus one 512-bit store every other cycle. The Intel CPUs with AVX-512 support and Zen 5 (server/desktop/Halo) can do two 512-bit loads plus one 512-bit store per cycle.

The other difference is that the most expensive Intel CPUs (typically Gold/Platinum Xeons) have a second floating-point multiplier, which is missing on Zen 4 and on the cheaper Intel SKUs. Thus Zen 4 can do one fused multiply-add (or FMUL) plus one FP addition per cycle, while the most expensive Intel CPUs can do two FMA or FMUL per cycle. This results in a double performance for the most expensive Intel CPUs vs. Zen 4 in many linear algebra benchmarks, e.g. Linpack or DGEMM. However there are many other applications of AVX-512 besides linear algebra, where a Zen 4 can be faster than most or even all Intel CPUs.

On the other hand, server/desktop/Halo Zen 5 has a double throughput for most 512-bit instructions in comparison with any Intel CPU. Presumably the future Intel Diamond Rapids server CPU will match the throughput of Zen 5 and Zen 6, i.e. of four 512-bit instructions per clock cycle.

On Zen 4, using AVX-512 provides very significant speedups in most cases over AVX2, despite the fact that the same execution resources are used. This proves that there still are cases when the ISA matters a lot.