Comment by mswphd

6 hours ago

dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.

0 comments

mswphd

No comments yet

Contribute on Hacker News ↗