Comment by greyskull
9 hours ago
Thanks, I'll have to continue experimenting. I just ran this model Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL and it works, but if gemini is to be believed this is saturating too much VRAM to use for chat context.
How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.
I didn't consider those other flags either, cool.
Are you having good luck with any particular harnesses or other tooling?
35B-A3B means it's a MoE model with 35B total parameters but with only 3B active at once. The one I use is the 27B dense model. Usually, dense models give better responses, but are slower than the MoE. With your 4090, you should be able to get about 50 tok/s with the dense model, which is more than enough for practical use.
If you want to keep using the same model, these settings worked for me.
llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
For the harness, I use pi (https://pi.dev/). And sometimes, I use the Roo Code plugin for VS Code. (https://roocode.com/)
I prefer simplicity in my tooling, so I can understand them easier. But you might have better luck with other harnesses.