Comment by imtringued

2 days ago

Yes but your CPUs have energy inefficient things like caches and out of order execution that do not help with fixed workloads like matrix multiplication. AMD gives you 32 AI Engines in the space of 3 regular Ryzen cores with full cache, where each AI Engine is more powerful than a Ryzen core for matrix multiplication.

I thought SSE2 and everything that came after like AVX 512 or SSE4 were made for streaming, leveraging the cache only for direct access to speed things up?

Haven't used SSE instructions for anything other than fiddling around with it yet, so I don't know if I'm wrong in this assumption. I understand the lock state argument about cores due to always max 2 cores being able to access the same cache/memory... but doesn't this have to be identical for FPUs if we compare this with SIMD + AVX?