Comment by chillitom

21 days ago

Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

9 comments

chillitom

chillitom 21 days ago

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

magicalhippo 21 days ago

> let the compilers know that the float* are aligned
Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.
I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.
Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?
In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...
Manually aligned the buffers and voila, the crashes stopped. Good times.
Remnant44 21 days ago
which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.
No reason for the compiler to balk at vectorizing unaligned data these days.
- dmpk2k 21 days ago
  
  > There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput
  Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
  
  1 reply →
- Sesse__ 21 days ago
  
  Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.
  Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!
  
  3 replies →