Comment by colordrops

2 months ago

I assume they didn't fix the memory bandwidth pain point though.

7 comments

colordrops

The memory bandwidth limitation is baked into the GB10, and every vendor is going to be very similar there.

I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.

storus 2 months ago
My guess is M5 Ultra will be like DGX Spark for token prefill and M3 Ultra for token generation, i.e. the best of both worlds, at FP4. Right now you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part; with M5U that should no longer be necessary. However given RAM prices situation I am wondering if M5U will ever get close to the price/performance of Spark + M3U we have right now.
- echion 2 months ago
  
  > you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part
  Are you doing this with vLLM, or some other model-running library/setup?
  
  1 reply →
kristianp 2 months ago
The M3 ultra was released about 18 months after the original M3, so you could be waiting a while for the M5 Ultra.
- llm_nerd 2 months ago
  
  The M3 Ultra was oddly delayed, though rumours are that the M5 Ultra should arrive much quicker. Most are estimating March-ish. We'll see. I think Apple has a much higher motivation to get the M5 higher end variants out given the enormous benefits the new matmul functionality offers.

cat_plus_plus 2 months ago

At least for transformers, it can be kind of fixed with MOE + NVFP4 for small working set despite large resident size.