Comment by magicalhippo
14 hours ago
For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1
| model | size | params | backend | ngl | test | t/s |
| --------------------- | ---------: |--------: | -------- | --: |------: |----------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | pp2048 | 10179.12 ± 52.86 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 999 | tg128 | 326.82 ± 7.82 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | pp2048 | 3129.92 ± 5.12 |
| qwen35 27B Q6_K | 23.87 GiB | 26.90 B | CUDA | 999 | tg128 | 53.45 ± 0.15 |
build: 9d34231bb (8929)
gpt-oss-20b-MXFP4.gguf
Qwen3.6-27B-UD-Q6_K_XL.gguf
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.
You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls.
5090 gets maybe 100TPS with MTP