Comment by woadwarrior01

3 months ago

> Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

So, no support for M5 Neural Accelerators, eh? (Requires Metal 4) ¯\_(ツ)_/¯

Ha, not yet. Metal 4 is interesting and we're keeping an eye on it.

MetalRT currently targets Metal 3.1 GPU compute because that's where we get the most control over the decode pipeline. Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.

That said, if Metal 4 opens up new capabilities that help with sequential token generation or gives better programmable access to the neural accelerator, we'll absolutely look at it. The M5 will be a fun chip to benchmark on.

  • > Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.

    What does the ANE have to with this?

    Neural Engine (ANE) and the M5 Neural Accelerator (NAX) are not the same thing. NAX can accelerate LLM prefill quite dramatically, although autoregressive decoding remains memory bandwidth bound.

    I suspect the biggest blocker for Metal 4 adoption is the macOS Tahoe 26 requirement.

    • Good correction, thanks. You're right that NAX and ANE are distinct, I shouldn't have conflated them. NAX's ability to accelerate LLM prefill is exactly the kind of capability that could complement MetalRT's decode-focused pipeline. Appreciate the clarification on the Metal 4 / Tahoe requirement too.