Comment by nyrikki
4 hours ago
That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s
I think that was all about some earlier crashes.
podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 \
--ctx-size 128000 \
--no-mmproj-offload \
--no-context-shift \
--kv-unified \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.75 \
-fa on --jinja --no-mmap \
--cache-ram -1 \
--no-warmup -np 1\
-n 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--min-p 0.00 \
--top-k 20 \
--top-p 0.95 \
--presence-penalty 0.0 \
--repeat-penalty 1.05 \
--fit off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--port 8080 \
--host 0.0.0.0
No comments yet
Contribute on Hacker News ↗