Comment by Remnant44
11 hours ago
which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.
No reason for the compiler to balk at vectorizing unaligned data these days.
> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.
A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!
A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.
The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.