Comment by btdmaster

20 days ago

In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].

Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.

[0] https://godbolt.org/z/vso7xbh61

[1] https://godbolt.org/z/MjcEKd9Tr

To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.

  • What's strange is I'm finding that gcc really struggles to correctly optimize this.

    This was my function

        for (auto v : array) {
            if (v != 0)
                return false;
        }
        return true;
    

    clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

    Here's gcc for 42 elements:

    https://godbolt.org/z/sjz7xd8Gs

    and here's clang for 42 elements:

    https://godbolt.org/z/frvbhrnEK

    Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

    Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.

    • Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.

      1 reply →

Except that the C++ version doesn't need to be like that.

Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.