Comment by nyrikki
5 hours ago
You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]
Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.
I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.
But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.
But play with YaRN if you really need it.
[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...
How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.
HEre's my setup:
(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)
(Note UPDATED config)
Ya, if you are using the CPU it may slowdown quick.
This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.
I am just building the container with:
And here is the logs from a 'make me a flappy bird program in python' webui prompt.
I am down to ~25.54 t/s with a 95% full context.
That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s
I think that was all about some earlier crashes.