Comment by nyrikki

4 hours ago

That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s

I think that was all about some earlier crashes.

     podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 128000 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1\
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --port 8080 \
    --host 0.0.0.0

0 comments

nyrikki

No comments yet

Contribute on Hacker News ↗