Personally, I doubt it. Apple hamstrung themselves with unified SOC memory, there are cheap dGPUs that smoke the M5's prefill speeds and even have faster decode too. Apple is running up against the limitations of putting a mobile integrated chipset up against the desktop form factor. An SOC stops looking like a smart decision at that scale.
The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.
The tradeoff is that the M3 Ultra's GPU loses to laptop GPUs in compute benchmarks. All of that bandwidth is wasted idling for token prefill.
For inference workloads, it makes a lot more sense to optimize for prefill/ttft before maxing out memory bandwidth.
With the M6 theoretically coming later this year, Apple seems to be realizing they need to catch up with more lanes of GPU.
Personally, I doubt it. Apple hamstrung themselves with unified SOC memory, there are cheap dGPUs that smoke the M5's prefill speeds and even have faster decode too. Apple is running up against the limitations of putting a mobile integrated chipset up against the desktop form factor. An SOC stops looking like a smart decision at that scale.
The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.