Comment by physicsguy
1 month ago
The typical way would be to unroll the inner loop manually; often you can get away with:
for (int i = 0; i < N; i += SIMD_WIDTH) {
for (int j = 0; j < SIMD_WIDTH) {
// do code
}
}
but failing the compiler optimising that you can do it more like:
for(int i = 0; i < N; i+= SIMD_WIDTH) {
float mask[8];
// do work into mask, find max of the mask
}
That's effectively what you're doing anyway in the SIMD code, but it keeps it more readable for mere mortals, and because you can define SIMD_WIDTH as a constant, it's also slightly easier to change if a new instruction set comes along; you're not maintaining multiple kernels.
No comments yet
Contribute on Hacker News ↗