← Back to context

Comment by zargon

10 hours ago

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

  • The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

      llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
    

    The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

    Meanwhile here's for Qwen 3.6 27B:

      llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
    

    So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

    I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

    That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.