Comment by mort96

1 month ago

Regarding (b), I would never rely on auto vectorization because I have no insight into it. The only way to check if my code is written such that auto-vectorization can do its thing is to compile my program with every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect. That's a terrible developer experience; writing intrinsics by hand is much easier, and more robust. I'd need to re-check every piece of autovectorized code after every change and after every compiler upgrade.

I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.

2 comments

mort96

physicsguy 1 month ago

> every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect

I just used to fire up VTune and inspect the hot loops... typically if you care about this you're only really working on hardware targeting the latest instruction sets anyway in my experience. It's only if you're working on low level libraries I would bother doing intrinsics all over the place.

For most consumer software you want to be able to fall back to some lowest-common-denominator hardware anyway otherwise people using it run into issues - same reason that Debian, Conda, etc. only go up to really old instructions sets.

mort96 1 month ago

I work on games sometimes, where the goal is: "run as fast as possible on everyone's computer, whether it's a 15 year old netbook with an Intel Atom or a freshly built beast of a gaming desktop". As a result, the best approach is to discover supported instructions at runtime and dispatch to a function that's using those instructions (maybe populating a global vector function table at launch?). Next best is to assume some base level vector support (maybe the original AVX for x86, Neon for ARM) and unconditionally use those. Targeting only the latest instruction sets is a complete non-starter.