← Back to context

Comment by menaerus

2 hours ago

I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?

Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.

Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.