Comment by CalChris

9 years ago

Just as information, the Loop Stream Detector was introduced in Intel Core microarchitectures. With Sandy Bridge, it was 28 μops. It grows to 56 μops (with HT off) with Haswell and Broadwell. It grows again (considerably) with Skylake to 64 μops per thread (HT on or off).

The LSD resides in the BPU (branch prediction unit) and it is basically telling the BPU to stop predicting branches and just stream from the LSD. This saves energy. However, predicting is different than resolving. Branch resolution still happens and when resolution (speculation) fails, the LSD bails out.

In any case, 64 μops is a lot. That's a good sized inner loop.