← Back to context

Comment by shihab

1 month ago

Hi, thanks for reading.

Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)

(c) Just to be clear, all the codes in benchmark figures (baseline and SIMD) were compiled with fast-math flags.

Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).

This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.

Fort what it's worth, I had the exact same experience you did when I started writing SIMD code explicitly with intrinsics.

I avoided it for a long time because, well, it was so damn ugly and verbose to do simple things. However, in actual practice it's not nearly as painful as it looks, and you get used to it quickly.

The typical way would be to unroll the inner loop manually; often you can get away with:

    for (int i = 0; i < N; i += SIMD_WIDTH) {
        for (int j = 0; j < SIMD_WIDTH) {
            // do code
        }
    }

but failing the compiler optimising that you can do it more like:

    for(int i = 0; i < N; i+= SIMD_WIDTH) {
        float mask[8];
        // do work into mask, find max of the mask
    }

That's effectively what you're doing anyway in the SIMD code, but it keeps it more readable for mere mortals, and because you can define SIMD_WIDTH as a constant, it's also slightly easier to change if a new instruction set comes along; you're not maintaining multiple kernels.