Comment by astrange

1 year ago

It is slow to move data from SIMD to scalar registers, or can be.

It depends, for SIMD float-> scalar floats it is fast as they operate on the same registers. If pulling out of lane 0 you don't even need to do anything(just a type cast). For other lanes you need a shuffle.

For SIMD integer to scalar integer, it has to move into separate register, so there is some short penalty(3 cycles iir).