Comment by physicsguy

1 month ago

The typical way would be to unroll the inner loop manually; often you can get away with:

    for (int i = 0; i < N; i += SIMD_WIDTH) {
        for (int j = 0; j < SIMD_WIDTH) {
            // do code
        }
    }

but failing the compiler optimising that you can do it more like:

    for(int i = 0; i < N; i+= SIMD_WIDTH) {
        float mask[8];
        // do work into mask, find max of the mask
    }

That's effectively what you're doing anyway in the SIMD code, but it keeps it more readable for mere mortals, and because you can define SIMD_WIDTH as a constant, it's also slightly easier to change if a new instruction set comes along; you're not maintaining multiple kernels.