Comment by speedgoose
6 days ago
I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.
It can notably run some of the best open weight models with little power and without triggering its fan.
It can run and the token generation is fast enough, but the prompt processing is so slow that it makes them next to useless. That is the case with my M3 Pro at least, compared to the RTX I have on my Windows machine.
This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.
Just add a DGX Spark for token prefill and stream it to M3 using Exo. M5 Ultra should have about the same compute as DGX Spark for FP4 and you don't have to wait until Apple releases it. Also, a 128GB "appliance" like that is now "super cheap" given the RAM prices and this won't last long.
>with little power and without triggering its fan.
This is how I know something is fishy.
No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.
I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.
It was for my team. Running useful LLMs on battery power is neat for example. Some simply care a bit about sustainability.
It’s also good value if you want a lot of memory.
What would you advice for people with a similar budget? It’s a real question.
But you arent really running LLMs. You just say you are.
There is novelty, but not practical use case.
My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.
Both are useful and were used for economic gain. I don't think you have gotten any gain.
4 replies →