Comment by Someone
1 year ago
There’s no guarantee that the fastest AVX2 assembly is equal on all CPUs, and reading https://stackoverflow.com/a/64782733, there are differences between CPUs.
So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.
I suspect that it is not worth using AVX2 vector gathers on any CPU. But certainly you could end up with the best implementation varying between microarchitectures for other reasons.