← Back to context

Comment by crote

3 days ago

What is the "nasty surprise" of Zen 4 AVX512? Sure, it's not quite the twice as fast you might initially assume, but (unlike Intel's downclocking) it's still a strict upgrade over AVX2, is it not?

It's splitting a 512 instruction into 2 256 instructions internally. That's the main nasty surpise.

I suppose it saves on the decoding portion a little but it's ultimately no more effective than just issuing the 2 256 instructions yourself.

  • Single pumped AVX512 can still be a lot more effective than double pumped AVX2.

    AVX512 has 2048 bytes of named registers; AVX2 has 512 bytes. AVX512 uses out of band registers for masking, AVX2 uses in band mask registers. AVX512 has better options for swizzling values around. All (almost all?) AVX512 instructions have masked variants, allowing you to combine an operation and a subsequent mask operation into a single operation.

    Often times I'll write the AVX512 version first, and go to write the AVX2 version, and a lot of the special sauce that made the AVX512 version good doesn't work in AVX2 and it's real awkward to get the same thing done.

  • The benefit seems to be that we are one step closer to not needing to have the fallback path. This was probably a lot more relevant before Intel shit the bed with consumer avx-512 with e-cores not having the feature

  • Predicated instructions are incredibly useful (and avx-512 only). They let you get rid of the usual tail handling at the end of the loop.

  • axv-512 for zen4 also includes a bunch of instructions that weren't in 256, including enhanced masking, 16 bit floats, bit instructions, double-sized double-width register file

> it's still a strict upgrade over AVX2

If you benchmark it, it will be slower about half the time.

  • for the simplest cases it will be about the same speed as avx2, but if you're trying to do anything fancy, the extra registers and instructions are a godsend.