Comment by chorizo

1 day ago

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

14 comments

chorizo

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

sanderjd 1 day ago

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

SkitterKherpi 1 day ago

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

SwellJoe 1 day ago
The 8-bit quantized 27B Qwen 3.6 is 29GB. You absolutely cannot run that entirely on a 24GB GPU.
You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.
32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).
You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.
- cpburns2009 1 day ago
  
  A 32gb card does run it nicely. I use unsloth's UD-Q5_K_XL at 256k context (k/v at q8_0), and get ~67 t/s on a 5090. I still need to look into MTP.
  
  2 replies →
bityard 1 day ago
Quantization is a trade-off, though. The quality, while still perhaps good enough for many tasks, is not as good as the full 16-bit weights that the model was designed for/released with.
- pbgcp2026 21 hours ago
  
  [dead]
barbacoa 1 day ago

I'm running qwen 3.6 27b at 8bit quantization and 262k context. It takes 53gb of vram on my system.
jnovek 1 day ago
I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).
- bityard 1 day ago
  
  No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.

angoragoats 20 hours ago

So buy two.