Comment by Vaskivo
5 hours ago
How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.
HEre's my setup:
llama-server
--port 9999
--model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
--ctx-size 128000
--threads 12
--flash-attn on
--device CUDA0
--jinja
--gpu-layers 52
--mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
--spec-type draft-mtp --spec-draft-n-max 2
(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)
(Note UPDATED config)
Ya, if you are using the CPU it may slowdown quick.
This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.
I am just building the container with:
And here is the logs from a 'make me a flappy bird program in python' webui prompt.
I am down to ~25.54 t/s with a 95% full context.
That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s
I think that was all about some earlier crashes.