Comment by sanchitmonga22

3 months ago

Ha, not yet. Metal 4 is interesting and we're keeping an eye on it.

MetalRT currently targets Metal 3.1 GPU compute because that's where we get the most control over the decode pipeline. Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.

That said, if Metal 4 opens up new capabilities that help with sequential token generation or gives better programmable access to the neural accelerator, we'll absolutely look at it. The M5 will be a fun chip to benchmark on.

2 comments

sanchitmonga22

woadwarrior01 3 months ago

> Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.

What does the ANE have to with this?

Neural Engine (ANE) and the M5 Neural Accelerator (NAX) are not the same thing. NAX can accelerate LLM prefill quite dramatically, although autoregressive decoding remains memory bandwidth bound.

I suspect the biggest blocker for Metal 4 adoption is the macOS Tahoe 26 requirement.

sanchitmonga22 3 months ago

Good correction, thanks. You're right that NAX and ANE are distinct, I shouldn't have conflated them. NAX's ability to accelerate LLM prefill is exactly the kind of capability that could complement MetalRT's decode-focused pipeline. Appreciate the clarification on the Metal 4 / Tahoe requirement too.