← Back to context

Comment by ack_complete

1 year ago

As a counterpoint, I regularly run into trivial cases that compilers are not able to autovectorize well:

https://gcc.godbolt.org/z/rjEqzf1hh

This is an unsigned byte saturating add. It is directly supported as a single instruction in both x86-64 and ARM64 as PADDUSB and UQADD.16B. But all compilers make a mess of it from a straightforward description, either failing to vectorize it or generating vectorized code that is much larger and slower than necessary.

This is with a basic, simple vectorization primitive. It's difficult to impossible to get compilers to use some of the more complex ones, like a rounded narrowing saturated right shift (UQRSHRN).

Oh I agree it is not foolproof, in fact I never understood why saturated math isn't 'standard' somewhere, even as an operator. Given we have 'normalisation' operator there's alway a way to find a natural looking syntax of sort.

But again, if you don't like the generated code, you can take the generated code and tweak it, and use that; I did it quite a few times.