Comment by adgjlsfhk1
1 year ago
The counterpoint to this is that if you can write AVX2 assembly, that will be supported on ~99% of x86 CPUs around today (Haswell was 2013), so just that one branch covers ~80% of the desktop/laptop market.
1 year ago
The counterpoint to this is that if you can write AVX2 assembly, that will be supported on ~99% of x86 CPUs around today (Haswell was 2013), so just that one branch covers ~80% of the desktop/laptop market.
94.67% according to Steam hardware survey, which is probably close enough.
https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...
There’s no guarantee that the fastest AVX2 assembly is equal on all CPUs, and reading https://stackoverflow.com/a/64782733, there are differences between CPUs.
So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.
I suspect that it is not worth using AVX2 vector gathers on any CPU. But certainly you could end up with the best implementation varying between microarchitectures for other reasons.
If you really care about performance though you'd want to be a lot more specific than this. I've seen image processing code that not only does things like avoid specific instructions on some CPU families (like for example it avoids the vpermd instruction on Zen1/2/3 CPU's because of excessive latency), but also queries the CPU cache topology at runtime and uses buffer allocation strategies that ensure that it can work in data batches that fit in cache.
hmmm... that's not exactly true. Hosts may not expose all instructions to VMs, especially certain hosts. So, yeah, I agree with you on the desktop/laptop market, but be wary if your target is servers.