Comment by umanwizard
1 year ago
SIMD doesn’t operate on a separate memory space or anything like that. You just load data from normal memory into the SIMD registers, just like you would have to load it into the scalar registers if you wanted to operate on it with normal instructions.
On some targets you need to overalign data for vectorization.
It is slow to move data from SIMD to scalar registers, or can be.
It depends, for SIMD float-> scalar floats it is fast as they operate on the same registers. If pulling out of lane 0 you don't even need to do anything(just a type cast). For other lanes you need a shuffle.
For SIMD integer to scalar integer, it has to move into separate register, so there is some short penalty(3 cycles iir).