Comment by josefcub

2 days ago

I've got 256GB of RAM on a Mac Studio M3 Ultra. Other posters are right: The M3 Ultra's prefill is super slow with really large models, 3-5 minutes while it digests the new additions to its context before it continues. On my heavy RAM model, I _can_ run 400b-500b models at Q2, and up to about 750b models at Q1, but the wait isn't the worst part.

Lower quants like that affect its output, making it less capable overall and letting it easily forget things.

Here's what I'd do with 96GB of RAM: Run Qwen 3.6 35b-a3b at Q8 for coding/agentic tasks. You'll get around 70tokens generated per second, the prefill is lightning fast in comparison, and you'll get a lot of work done. Qwen 3.6 27b is out now too, and I'm getting 17tok/sec token generation with a slower prefill.

The upshot is that you'll still have 20-40GB of RAM left for your workstation and development loads. Running Qwen 3.6 35b or 27b at Q8 quantization, the model at 128k context uses about 40GB of RAM; my OS and application load uses 20-30GB most of the time, for a total of 60-70. That's plenty of room in memory for you to work _and_ run inference.

You _may_ end up getting Deepseek 4 Flash running, but it'll be a lower quantization like Q2 or Q3, making it kind of dumb in comparison. And you may not have enough memory left over for any appreciable amount of context. Working with today's reasoning models needs context for it to generate and give out good answers. Doubly so for agentic/coding tasks.

1 comment

josefcub

namegulf 2 days ago

Thanks, that's very helpful.

Totally agree, context is everything for agentic coding.

Any other hardware reco that'll help run larger models?