← Back to context

Comment by kbolino

5 days ago

I suspected this was because the vector units were not wide enough, and it seems that is the case. AVX2 is 256-bit, ARM NEON is only 128-bit.

The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.

> The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support?

Very wide SIMD instructions require a lot of die space and a lot of power.

The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)

Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.

Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.

Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.

Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.

  • > The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area

    That chip family was pretty much designed to provide just enough CPU power to keep the vector engines fed. So that 40% is an upper bound, what you get when you try to build a GPU out of somewhat-specialized CPU cores (which was literally the goal of the first generation of that lineage).

    For a general purpose chip, there's no reason to spend that large a fraction of the area on the vector units. Something like the typical ARM server chips with lots of weak cores definitely doesn't need each core to have a vector unit capable of doing 512-bit operations in a single cycle, and probably would be better off sharing vector units between multiple cores. For chips with large, high-performance CPU cores (eg. x86), a 512-bit vector unit will still noticeably increase the size of a CPU core, but won't actually dwarf the rest of the core the way it did for Xeon Phi.

  • Knights Landing is a major outlier; the cores there were extremely small and had very few resources dedicated to them (e.g. 2-wide decode) relative to the vector units, so of course that will dominate. You aren't going to see 40% of the die dedicated to vector register files on anything looking like a modern, wide core. The entire vector unit (with SRAM) will be in the ballpark of like, cumulative L1/L2; a 512-bit register is only a single 64 byte cache line, after all.

    • Also, the Knights Landing/Mill implementation is completely different from modern AVX-512. It's Ice Lake and Zen 4 that introduced modern AVX-512.

    • True! But even if only 20% of the die area goes to AVX-512 in larger cores, that makes a big difference for high core count CPUs.

      That would be like having a 50-core CPU instead of a 64-core CPU in the same space. For these cloud native CPU designs everything that takes significant die area translates to reduced core count.

      1 reply →

  • The rarity of use is a chicken-egg problem, though. The hardware makers consider it a waste because the software doesn't use it, and the software makers won't use it because it's not widely supported enough. Apple and Qualcomm not supporting it at all on any of their hardware tiers just exacerbates it. I think this is a good explanation for why mobile devices lack it, and even why say a MacBook Air or Mac Mini lacks it, but not why a MacBook Pro or Mac Studio lacks it.

    It does seem like server hardware is adopting SVE at least, even if it's not always paired with wider registers. There are lots of non-math-focused instructions in there that benefit many kinds of software that isn't transferable to a GPU.

    • > I think this is a good explanation for why mobile devices lack it, and even why say a MacBook Air or Mac Mini lacks it, but not why a MacBook Pro or Mac Studio lacks it.

      Apple has the problem they "have to" have a professional "studio" lineup, but the prices are too high / the market volume too low to justify creating and validating what essentially is a fork of their SoC architecture.

  • KNL is an almost 15 years old uarch expressly designed to compete with dedicated SIMD processors (GPGPU), dedicating the die to vector is the point of that chip.

  • Yeah this seems likely, but with all the LLM stuff it might be an outdated assumption.

    Buy new chips next year! Haha :)

Wider SIMD is a solution in search of a problem in most cases.

If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.

If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).

You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.

The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.

It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases

All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.

  • Video encoding and image compression is a huge use case, and not at all uncommon, so much so that a lot of hardware has dedicated hardware for it. Of course, offloading the SIMD instructions to dedicated hardware accelerators does reduce usage of SIMD instructions, but any time a specific CODEC or algorithm isn't accelerated, then the SIMD instructions are absolutely necessary.

    Emulators also use them a lot, often in unintended ways, because they are very flexible. This is partially because the emulator itself can use the flexibility to optimize emulation, but also because hand optimizing with SIMD instruction can significantly improve performance of any application, which is necessary for the low-performance processors common in videogame consoles.

SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side. RISC-V did the same thing with RVV, for better or worse.

  • > SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.

    You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

    "runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.

    If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.

    • > You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

      Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.

      ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.

      4 replies →

  • Yeah, the extensions exist, and as pointed out by a sibling comment to yours, have been implemented in supercomputer cores made by Fujitsu. However, as far as I know, neither Apple nor Qualcomm have made any desktop cores with SVE support. So the biggest reason there's no desktop software for it is because there's no hardware support.

  • The only CPU I've encountered that supports SVE is the Cortex-X925/A725 that is used in the NVIDIA DGX Spark platform. The vector width is still only 128 bits, but you do get access to the other enhancements the SVE instructions give, like predication (one of the most useful features from Intel's AVX512).

  • RISC-V chip designers at least seem to be more bullish on vectors. There is seriously cool stuff coming like the SpacemiT K3 with 1024-bit vectors :)

ARM favored wider ILP and mostly symmetric ALUs, while x86 favored wider and asymmetric ALUs

Most high-end ARM cores were 4x128b FMA, and Cortex-X925 goes to 6x128b FMA. Contrast that to Intel that was 2x256b FMA for the longest, then 2x512b FMA, with another 1-2 pipelines that can't do FMA.

But ultimately, 4x128b ≈ 2x256b, and 2x256b < 6x128b < 2x512b in throughput. Permute is a different factor though, if your algorithm cares about it.

[removed]

  • Part of the reason, I think, is that Qualcomm and Apple cut their teeth on mobile devices, and yeah wider SIMD is not at all a concern there. It's also possible they haven't even licensed SVE from Arm Holdings and don't really want to spend the money on it.

    In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.

  • A ton of vector math applications these days are high dimensional vector spaces. A good example of that for arm would I guess be something like fingerprint or face id.

    Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.

    • > A good example of that for arm would I guess be something like fingerprint or face id.

      So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?

      2 replies →

Hasn't there been issues with AVX2 causing such a heavy load on the CPU that frequency scaling would kick in a lot of cases slowing down the whole CPU?

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.

  • Throttling was mainly an issue with AVX512, which is twice the width of AVX2, and only really on the early Skylake (2015) implementation. From your own source Ice Lake (2019) barely flinches and Rocket Lake (2021) doesn't proactively downclock at all. AMDs implementation came later but was solid right out of the gate.

  • This is a bit short-sighted. Yes, it is kinda tricky to get right, and a number of programming languages are quite behind on good SIMD support (though many are catching up).

    SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.

  • That’s only on very old CPUs. Getting benefits from vector extensions is incredibly easy if you do any kind of data crunching. A lot of integer operations not covered by BLAS can benefit including modern hash tables.

  • Re hard to get benefits: a lot depends on the compiler. In Elements (the toolchain this article was tested with) we made a bunch of modifications to LLVM passes to prioritise vectorisation in situations where it could, but did not.

    I've heard anecdotally that the old pre-LLVM Intel C++ Compiler also focused heavily on vectorisation and had some specific tradeoffs to achieve it. I think they use LLVM now too and for all I know they've made similar modifications that we did. But we see a decent number of code patterns that can and now are optimised.

  • the modern approach is much more fine grained throttling so by the time it throttles you already are coming out ahead.