← Back to context

Comment by buserror

1 year ago

I used to do quite a bit of SIMD version of critical functions, but now I rarely do -- one thing to try is isolate that code, and run it in the Most Excellent Compiler Explorer [0].

And stare at the generated code!

More often than not, the auto-vectorisation now generates pretty excellent SIMD version of your function, and all you have to do is 'hint' the compiler -- for example explicitly list alignment, provide your own vector source/destination type -- you can do a lot by 'styling' your C code while thinking about what the compiler might be able to do with it -- for example, use extra intermediary variables, really break down all the operations you want etc.

Worst case if REALLY the compiler isn't clever enough, this give you a good base to adapt the generated assembly to tweak, without having to actually write the boilerplate bits.

In most case, the resulting C function will be vectorized as good, or better than the hand coded one I'd do -- and in many other cases, it's "close enough" not to matter that much. The other good news is that that code will probably vectorize fine for WASM and NEON etc without having to have explicit versions.

[0] https://godbolt.org/

We did something slightly similar - for the very few isolated things it makes sense (e.g. image up/download and conversions in the gpu driver that weren't supported/large enough to be worth firing off a gpu job to complete), they were initially written in C and used the compiler annotations to specify things like the alignment or allowed pointer aliasing in order to make it generate the code wanted. GCC and Clang both support some vector extensions, that allow somewhat portable implementations of things like scatter-gather, or shuffling things around or masking elements in a single register that's hard to specify clearly enough so that it's both readable for humans and will always generate the expected code between compiler versions in "plain" C.

But due to needing to support other compilers and platforms we actually ended up importing the generated asm from those source files in the actual build.

As a counterpoint, I regularly run into trivial cases that compilers are not able to autovectorize well:

https://gcc.godbolt.org/z/rjEqzf1hh

This is an unsigned byte saturating add. It is directly supported as a single instruction in both x86-64 and ARM64 as PADDUSB and UQADD.16B. But all compilers make a mess of it from a straightforward description, either failing to vectorize it or generating vectorized code that is much larger and slower than necessary.

This is with a basic, simple vectorization primitive. It's difficult to impossible to get compilers to use some of the more complex ones, like a rounded narrowing saturated right shift (UQRSHRN).

  • Oh I agree it is not foolproof, in fact I never understood why saturated math isn't 'standard' somewhere, even as an operator. Given we have 'normalisation' operator there's alway a way to find a natural looking syntax of sort.

    But again, if you don't like the generated code, you can take the generated code and tweak it, and use that; I did it quite a few times.

Problem is, you have to take care to look at the compiler output and compare it to your expectations. Maybe fiddle with it a bit until it matches what you would have written yourself. Usually, it is quicker to just write it yourself...

  • > Problem is, you have to take care to look at the compiler output and compare it to your expectations. Maybe fiddle with it a bit until it matches what you would have written yourself.

    And keep redoing that for every new compiler or version of a compiler, or if you change compile options. Any of those things can prevent the auto-vectorization.

IME, auto-vectorization is a fragile optimization that will silently fail under all sorts of conditions. I don't like to rely on it.

  • You can just store the generated binary / assembly and rely on that if you want stable code.

I have no idea how to get the compiler to generate wider-than-16 pshufb in the general case, for example, and for the 16-wide case, writing the actual definition of pshufb prevents you from getting pshufb while writing a version with UB gets you pshufb.