← Back to context

Comment by hadlock

4 hours ago

How many t/s output are you getting at Q4_K_M with 200k context on your Strix Halo if you ask it to add a new feature to a codebase.

Qwen 3.6 27B, and other dense models, as opposed to MoE models do NOT scale well. Like I said in my original post, for 27B usage specifically, I'd take a dGPU with 32GB of VRAM over Strix Halo. I also don't usually benchmark out to 200k, my typical depths are 0, 16k, 32k, 64k, 128k. That said, with Qwen 3.5 122B A10B, I am still getting 70 tok/s PP speed and 20 tok/s TG speed at 128k depth, and with Nemotron 3 Super 120B A10B, 160 tok/s PP speed and 16 tok/s TG speed at 128k depth. With smaller MoE models, I did bench Qwen 3.6 35B A3B at 214 tok/s PP at 128k and 34.5 tok/s TG at 131k.

Because dense models degrade so severely, I rarely bench them past 32k-64k, however, I did find a Gemma4 31B bench I did - down to 22 tok/s PP speed and 6 tok/s TG speed at 128k.

Nemotron models specifically, because of their Mamba2 hybrid SSM architecture, scale exceptionally well, and I have benchmarks for 200k, 300k, 400k, 500k, and 600k for Nemotron 3 Super. I will use depth: PP512/TG128 for simplicity.

100k: 206/16 200k: 136/16 300k: 95/14 400k: 61/13 500k: 45/13 600k: 36/12