Comment by variadix

1 year ago

C compilers are still pretty bad at auto vectorization. For problems where SIMD is applicable, you can reasonably expect a 2x-16x speed up over the naive scalar implementation.

2 comments

variadix

astrange 1 year ago

Also, if you write code with intrinsics the autovectorization can make it _worse_. eg a pattern is to write a SIMD main loop and then a scalar tail, but it can autovectorize that and mess it up.

janwas 1 year ago

Given the wider availability of masking (AVX-512, RISC-V and SVE), I figure scalar tails are no longer the preferred pattern everywhere.