Comment by adrian_b

2 days ago

It has been introduced in Intel Penryn, in November 2007.

However the AMD CPUs did not implement it until Bulldozer, in mid 2011.

While they lacked the many additional instructions provided by Bulldozer, also including AVX and FMA, for many applications the older Opteron CPUs were significantly faster than the Bulldozer-based CPUs, so there were few incentives for upgrading them, before the launch of AMD Epyc in mid 2017.

SSE 4.1 is a cut point in supporting old CPUs for many software packages, because older CPUs have a very high overhead for divergent computations (e.g. with if ... else ...) inside loops that are parallelized with SIMD instructions.