Comment by ack_complete

8 hours ago

The REP MOVS series of instructions have an interesting history due to the advantages and disadvantages of microcode and its shifting performance relative to manual code with each CPU generation. It has long been great for aligned large copies due to the microcode having access to cache-wide copies, but until recently struggled with small copies. Apparently, one of the reasons is a lack of branch prediction in microcode:

https://stackoverflow.com/questions/33902068/what-setup-does...

Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.