Comment by IceWreck

7 hours ago

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

1 comment

IceWreck

zozbot234 6 hours ago

Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.