Comment by SwellJoe

2 hours ago

A 4-bit quantization of either Qwen 3.6 27b or Gemma 4 31b will run on a 32GB Mac with a decent-sized, but not full-sized, context. 64GB gets you the full ~256k context and you don't need to quantize your KV cache (though 8-bit quantization of KV may be worth it for performance). The 4-bit QAT version of Gemma 4 has practically identical performance to the full size version or the 8-bit version in most benchmarks and my tests, so there's no reason to run anything else. The 4-bit Qwen is a little bit lossy, as it hasn't gotten the QAT treatment, but not catastrophically lossy. A 6-bit dynamic quantization would be better for that model, but it's ~25GB on disk, and you'll need more than 32GB to run it with a big context.

I wrote up how I run local LLMs, with numbers and a focus on running Qwen 3.6 and Gemma 4. I prefer Gemma 4 31b, even though the general consensus is that Qwen 3.6 is better for code, and it is better on most coding focused benchmarks...it doesn't seem to be for my use cases, Gemma feels smarter. And, with QAT, you get more smarts in less memory, so it's fast and runs on more hardware.

https://swelljoe.com/post/how-i-run-local-llms/

Currently, the sweet spot for self-hosted models is either Qwen 3.6 or Gemma 4, and those top out at 31B (Gemma) and 35B (for Qwen, but you want the dense Qwen 3.6 27B if you can run it as reasonable speed...the dense models are much smarter), so for now, a system with 64GB or 128GB is going to be running the same models. Going to a bigger model doesn't get you better performance because there aren't any better models that are a little bigger. I wish there was a ~70B or even ~120B MoE in the Qwen 3.6 or Gemma 4 families, as I've got a Strix Halo running a model that leaves a lot of memory on the table (and it's not very fast, to boot...an MoE would be faster, and hopefully smarter if it's a much bigger model, like double or triple sized).

In short, right now, 64GB is all you need for the best models you can self-host on anything short of five-figure machines, but, I wouldn't buy any hardware right now, if you can wait a while. Tokens from DeepSeek are so cheap, you can wait out the memory shortage and get access to models you could never host locally. And, OpenRouter always has free models in preview or just because that you can use lightly, as they're rate-limited (but your self-hosted models are going to be rate-limited, too, because a Mac Mini can't run models very fast). Google AI Studio has the Gemma 4 models for free too, also rate/usage limited.

0 comments

SwellJoe

No comments yet

Contribute on Hacker News ↗