Comment by menaerus
4 hours ago
I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?
Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.
Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.
My memory is that Ivy Bridge was when it started being different.
AVX maybe?