Comment by maccard
15 days ago
Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.
This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)
This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.
Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.