Comment by chillitom

10 hours ago

Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

  • > let the compilers know that the float* are aligned

    Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.

    I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.

    Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?

    In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...

    Manually aligned the buffers and voila, the crashes stopped. Good times.

  • which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

    No reason for the compiler to balk at vectorizing unaligned data these days.

    • > There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput

      Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...

      1 reply →