Comment by JSR_FDED

8 hours ago

I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a consumer graphics card, which leaves the Macs.

Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.

7 comments

JSR_FDED

SwellJoe 1 hour ago

The current dense models from Gemma 4 or Qwen 3.6 families will run well on a consumer GPU with 32GB in a 4-bit quantization (which is a little lossy for Qwen 3.6, not so much for Gemma 4, as it has a QAT 4-bit version). Even an Intel ARC B70 will work, though it's worth spending a little more for a the AMD Radeon AI Pro 9700, as it'll be like 40% faster, I think. A dedicated GPU will be faster and cheaper than a Mac Mini. But, nothing is a good deal right now, everything is overpriced (except DeepSeek tokens, which cost pennies to run a model that's better than anything you could self-host...DeepSeek V4 Flash, and even Pro, are absurdly cheap, made even cheaper by their bonkers cheap cached token pricing and uniquely effective caching).

blensor 8 hours ago

The reason why I was curious is that I am running my stuff on a Strix Halo and I get the feeling that this class of devices ( gmktek, minisforum, lenovo, etc. ) seem to becoming a pretty good alternative

c7b 5 hours ago
Unified memory feels like the future of consumer hardware, agreed! Do check out r/StrixHalo
- blensor 2 hours ago
  
  Agreed, it was a bit of a pain to get running on my Ubuntu machine because I had old amdgpu-dkms-firmware packages installed without realizing it. But now that it's running it's amazing how well it works
  
  1 reply →
adastra22 5 hours ago

Strix Halo is better performance than a Mac Mini, but not as good as a Mac Studio. But the 128GB unified memory is awesome for larger models.

mswphd 3 hours ago

dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.