Comment by trilogic

7 hours ago

Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc. Here a basic version of it: https://hugston.com/chat

7 comments

trilogic

rspoerri 7 hours ago

how do you do 1mio context with qwen3.6 27b, that only supports 256k? and what hardware would you run that on? 2 * 3090 is afaik currently at max 256k context.

nyrikki 6 hours ago
You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]
Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.
I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.
But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.
But play with YaRN if you really need it.
[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...
- Vaskivo 5 hours ago
  
  How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.
  HEre's my setup:
  llama-server --port 9999 --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf --ctx-size 128000 --threads 12 --flash-attn on --device CUDA0 --jinja --gpu-layers 52 --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --spec-type draft-mtp --spec-draft-n-max 2
  (I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)
  
  2 replies →
omneity 6 hours ago

You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.
But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].
0: https://medium.com/@leannetan/extending-context-length-with-...
1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...
trilogic 6 hours ago

We managed to increase the ctx for whatever llm model that is GGUFED, here the experimental tests: https://www.reddit.com/r/Hugston/