← Back to context

Comment by trilogic

7 hours ago

Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc. Here a basic version of it: https://hugston.com/chat

how do you do 1mio context with qwen3.6 27b, that only supports 256k? and what hardware would you run that on? 2 * 3090 is afaik currently at max 256k context.

  • You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]

    Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.

    I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.

    But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.

    But play with YaRN if you really need it.

    [0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...

    • How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.

      HEre's my setup:

        llama-server
        --port 9999
        --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
        --ctx-size 128000
        --threads 12
        --flash-attn on
        --device CUDA0
        --jinja
        --gpu-layers 52
        --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
        --cache-type-k q8_0
        --cache-type-v q8_0
        --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
        --spec-type draft-mtp --spec-draft-n-max 2
      

      (I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)

      2 replies →