Comment by spacecadet_
8 days ago
> Very few instructions even allowed interaction between the top and bottom 128 bits
That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.
EDIT: I just noticed that the benchmark in the article is pure math which probably wouldn't hit this particular issue, so this doesn't explain the performance difference...
No comments yet
Contribute on Hacker News ↗