Comment by AlotOfReading

4 hours ago

C is a programming language. It makes for a very shitty high level assembler.

Here's a trivial example clang will often implement differently on different systems, producing two different results. Clang x64 will generally mul+add, while clang arm64 is aggressive about fma.

    x = 3.0f*x+1.0f;

But that's just the broad strategy. Depending on the actual compiler flags, the assembly generated might include anything up to multiple function calls under the hood (sanitizers, soft floats, function profiling, etc).

I don't think clang is being "aggressive" on ARM, it's just that all aarch64 targets support fma. You'll get similar results with vfmadd213ss on x86-64 with -march=haswell (13 years old at this point, probably a safe bet).

    float fma(float x) {
        return 3.0f * x + 1.0f;
    }

Clang armv8 21.1.0:

    fma(float):
        sub     sp, sp, #16
        str     s0, [sp, #12]
        ldr     s1, [sp, #12]
        fmov    s2, #1.00000000
        fmov    s0, #3.00000000
        fmadd   s0, s0, s1, s2
        add     sp, sp, #16
        ret

Clang x86-64 21.1.0:

    .LCPI0_0:
        .long   0x3f800000
    .LCPI0_1:
        .long   0x40400000
    fma(float):
        push    rbp
        mov     rbp, rsp
        vmovss  dword ptr [rbp - 4], xmm0
        vmovss  xmm1, dword ptr [rbp - 4]
        vmovss  xmm2, dword ptr [rip + .LCPI0_0]
        vmovss  xmm0, dword ptr [rip + .LCPI0_1]
        vfmadd213ss     xmm0, xmm1, xmm2
        pop     rbp
        ret

  • The point is that there are multiple, meaningfully different implementations for the same line, not that either is wrong. Sometimes compilers will even produce both implementations and call one or the other based on runtime checks, as this ICC example does:

    https://godbolt.org/z/KnErdebM5