Comment by zargon

10 hours ago

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

2 comments

zargon

zargon 7 hours ago

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

magicalhippo 3 hours ago
The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):
llama_kv_cache: size = 5120.00 MiB (262144 cells, 10 layers, 4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.
Meanwhile here's for Qwen 3.6 27B:
llama_kv_cache: size = 3072.00 MiB ( 49152 cells, 16 layers, 4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.
I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.
That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.