Comment by Someone

1 year ago

There’s no guarantee that the fastest AVX2 assembly is equal on all CPUs, and reading https://stackoverflow.com/a/64782733, there are differences between CPUs.

So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.

I suspect that it is not worth using AVX2 vector gathers on any CPU. But certainly you could end up with the best implementation varying between microarchitectures for other reasons.