Comment by nyrikki

5 hours ago

You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]

Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.

I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.

But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.

But play with YaRN if you really need it.

[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...

How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.

HEre's my setup:

  llama-server
  --port 9999
  --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
  --ctx-size 128000
  --threads 12
  --flash-attn on
  --device CUDA0
  --jinja
  --gpu-layers 52
  --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
  --cache-type-k q8_0
  --cache-type-v q8_0
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
  --spec-type draft-mtp --spec-draft-n-max 2

(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)

  • (Note UPDATED config)

    Ya, if you are using the CPU it may slowdown quick.

    This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.

        podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
        -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
        -ngl 99 \
        --ctx-size 131072 \
        --no-mmproj-offload \
        --no-context-shift \
        --kv-unified \
        --spec-type draft-mtp \
        --spec-draft-n-max 6 \
        --spec-draft-p-min 0.75 \
        -fa on --jinja --no-mmap \
        --cache-ram -1 \
        --no-warmup -np 1 \
        -n 32768 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --temp 0.6 \
        --min-p 0.00 \
        --top-k 20 \
        --top-p 0.95 \
        --presence-penalty 0.0 \
        --repeat-penalty 1.05 \
        --fit off \
        --reasoning on \
        --chat-template-kwargs '{"preserve_thinking":true}' \
        --prio 3 \
        --poll 100 \
        --port 8080 \
        --host 0.0.0.0
    
    

    I am just building the container with:

         podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
    

    And here is the logs from a 'make me a flappy bird program in python' webui prompt.

         prompt eval time =     105.86 ms /    19 tokens (    5.57 ms per token,   179.47 tokens per second)
           eval time =  100549.41 ms /  4608 tokens (   21.82 ms per token,    45.83 tokens per second)
          total time =  100655.28 ms /  4627 tokens
         draft acceptance rate = 0.47215 ( 3408 accepted /  7218 generated)
    

    I am down to ~25.54 t/s with a 95% full context.

    • That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s

      I think that was all about some earlier crashes.

           podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
          -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
          -ngl 99 \
          --ctx-size 128000 \
          --no-mmproj-offload \
          --no-context-shift \
          --kv-unified \
          --spec-type draft-mtp \
          --spec-draft-n-max 2 \
          --spec-draft-p-min 0.75 \
          -fa on --jinja --no-mmap \
          --cache-ram -1 \
          --no-warmup -np 1\
          -n 32768 \
          --cache-type-k q8_0 \
          --cache-type-v q8_0 \
          --temp 0.6 \
          --min-p 0.00 \
          --top-k 20 \
          --top-p 0.95 \
          --presence-penalty 0.0 \
          --repeat-penalty 1.05 \
          --fit off \
          --reasoning on \
          --chat-template-kwargs '{"preserve_thinking":true}' \
          --port 8080 \
          --host 0.0.0.0