Comment by aunty_helen
1 day ago
I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
Good to know. Might be worth updating the motherboard then, it’s limited in pcie speed.