Comment by jychang
13 hours ago
32GB vram is more than enough for Qwen 3.5 35b
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.