Comment by chillitom

20 days ago

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

8 comments

chillitom

> let the compilers know that the float* are aligned

Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.

I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.

Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?

In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...

Manually aligned the buffers and voila, the crashes stopped. Good times.

Remnant44 20 days ago

which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

dmpk2k 20 days ago
> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
- Remnant44 20 days ago
  
  There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.
  A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!
  A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.
  The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.
Sesse__ 20 days ago
Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.
Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!
- jandrewrogers 20 days ago
  
  With the older microarchitectures there was a large penalty for crossing a cache line with AVX-512. In some cases, the performance could be worse than AVX2!
  In older microarchitectures like Ice Lake it was pretty bad, so you wanted to avoid unaligned loads if you could. This penalty has rapidly shrunk across subsequent generations of microarchitectures. The penalty is still there but on recent microarchitectures it is small enough that the unaligned case often isn't a showstopper.
  The main reason to use aligned loads in code is to denote cases where you expect the address to always be aligned i.e. it should blow up if it isn't. Forcing alignment still makes sense if you want predictable, maximum performance but it isn't strictly necessary for good performance on recent hardware in the way it used to be.
- Remnant44 20 days ago
  
  AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.
  The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.
  
  1 reply →