Comment by Aurornis
5 days ago
> The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support?
Very wide SIMD instructions require a lot of die space and a lot of power.
The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.
> The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area
That chip family was pretty much designed to provide just enough CPU power to keep the vector engines fed. So that 40% is an upper bound, what you get when you try to build a GPU out of somewhat-specialized CPU cores (which was literally the goal of the first generation of that lineage).
For a general purpose chip, there's no reason to spend that large a fraction of the area on the vector units. Something like the typical ARM server chips with lots of weak cores definitely doesn't need each core to have a vector unit capable of doing 512-bit operations in a single cycle, and probably would be better off sharing vector units between multiple cores. For chips with large, high-performance CPU cores (eg. x86), a 512-bit vector unit will still noticeably increase the size of a CPU core, but won't actually dwarf the rest of the core the way it did for Xeon Phi.
Knights Landing is a major outlier; the cores there were extremely small and had very few resources dedicated to them (e.g. 2-wide decode) relative to the vector units, so of course that will dominate. You aren't going to see 40% of the die dedicated to vector register files on anything looking like a modern, wide core. The entire vector unit (with SRAM) will be in the ballpark of like, cumulative L1/L2; a 512-bit register is only a single 64 byte cache line, after all.
Also, the Knights Landing/Mill implementation is completely different from modern AVX-512. It's Ice Lake and Zen 4 that introduced modern AVX-512.
True! But even if only 20% of the die area goes to AVX-512 in larger cores, that makes a big difference for high core count CPUs.
That would be like having a 50-core CPU instead of a 64-core CPU in the same space. For these cloud native CPU designs everything that takes significant die area translates to reduced core count.
You're still grossly overestimating the area required for AVX-512. For example, on AMD Zen4, the entire FPU has been estimated as 25% of the core+L2 area, and that's including AVX-512. If you look at the extra area required for AVX-512 vs 256-bit AVX2, as a fraction of total die area including L3 cache and interconnect between cores, it's definitely not going to be a double digit percentage.
The rarity of use is a chicken-egg problem, though. The hardware makers consider it a waste because the software doesn't use it, and the software makers won't use it because it's not widely supported enough. Apple and Qualcomm not supporting it at all on any of their hardware tiers just exacerbates it. I think this is a good explanation for why mobile devices lack it, and even why say a MacBook Air or Mac Mini lacks it, but not why a MacBook Pro or Mac Studio lacks it.
It does seem like server hardware is adopting SVE at least, even if it's not always paired with wider registers. There are lots of non-math-focused instructions in there that benefit many kinds of software that isn't transferable to a GPU.
> I think this is a good explanation for why mobile devices lack it, and even why say a MacBook Air or Mac Mini lacks it, but not why a MacBook Pro or Mac Studio lacks it.
Apple has the problem they "have to" have a professional "studio" lineup, but the prices are too high / the market volume too low to justify creating and validating what essentially is a fork of their SoC architecture.
KNL is an almost 15 years old uarch expressly designed to compete with dedicated SIMD processors (GPGPU), dedicating the die to vector is the point of that chip.
Yeah this seems likely, but with all the LLM stuff it might be an outdated assumption.
Buy new chips next year! Haha :)