Comment by zargon

6 hours ago

I just loaded up Qwen3.6 27B at Q8_0 quantization in llama.cpp, with 131072 context and Q8 kv cache:

  build/bin/llama-server \
    -m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
    --no-mmap \
    --n-gpu-layers all \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --jinja \
    --no-mmproj \
    --parallel 1 \
    --cache-ram 4096 -ctxcp 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'

Should fit nicely in a single 5090:

  self    model   context   compute
  30968 = 25972 +    4501 +     495

Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.

0 comments

zargon

No comments yet

Contribute on Hacker News ↗