← Back to context

Comment by bitwize

3 days ago

"WHAT IS MY PURPOSE?"

"You multiply matrices of INT8s."

"OH... MY... GOD"

NPUs really just accelerate low-precision matmuls. A lot of them are based on systolic arrays, which are like a configurable pipeline through which data is "pumped" rather than a general purpose CPU or GPU with random memory access. So they're a bit like the "synergistic" processors in the Cell, in the respect that they accelerate some operations really quickly, provided you feed them the right way with the CPU and even then they don't have the oomph that a good GPU will get you.

Do compilers know how to take advantage of that, or do programs need code that specifically takes advantage of that?

  • It’s more like you need to program a dataflow rather than a program with instructions or vliw type processors. They still have operations but for example I don’t think ethos has any branch operations.

  • There are specialized computation kernels compiled for NPUs. A high-level program (that uses ONNX or CoreML, for example) can decide whether to run the computation using CPU code, a GPU kernel, or an NPU kernel or maybe use multiple devices in parallel for different parts of the task, but the low-level code is compiled separately for each kind of hardware. So it's somewhat abstracted and automated by wrapper libraries but still up to the program ultimately.

My question is: Isn't this exactly what SIMD has done before? Well, or SSE2 instructions?

To me, an NPU and how it's described just looks like a pretty shitty and useless FPGA that any alternative FPGA from Xilinx could easily replace.

  • Yes but your CPUs have energy inefficient things like caches and out of order execution that do not help with fixed workloads like matrix multiplication. AMD gives you 32 AI Engines in the space of 3 regular Ryzen cores with full cache, where each AI Engine is more powerful than a Ryzen core for matrix multiplication.

    • I thought SSE2 and everything that came after like AVX 512 or SSE4 were made for streaming, leveraging the cache only for direct access to speed things up?

      Haven't used SSE instructions for anything other than fiddling around with it yet, so I don't know if I'm wrong in this assumption. I understand the lock state argument about cores due to always max 2 cores being able to access the same cache/memory... but doesn't this have to be identical for FPUs if we compare this with SIMD + AVX?

  • You definitely would use SIMD if you were doing this sort of thing on the CPU directly. The NPU is just a large dedicated construct for linear algebra. You wouldn't really want to deploy FPGAs to user devices for this purpose because that would mean paying the reconfigurability tax in terms of both power-draw and throughput.

So it's a higher power DSP style device. Small transformers for flows. Sounds good for audio and maybe tailored video flow processing.