Comment by johndough
4 hours ago
Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 0.684 | 1462.61 | 2.869 | 44.61 | 3.553 | 317.47 |
| 2000 | 128 | 1 | 2128 | 1.390 | 1438.84 | 2.868 | 44.64 | 4.258 | 499.80 |
| 4000 | 128 | 1 | 4128 | 2.791 | 1433.18 | 2.886 | 44.35 | 5.677 | 727.11 |
| 8000 | 128 | 1 | 8128 | 5.646 | 1416.98 | 2.922 | 43.80 | 8.568 | 948.65 |
| 16000 | 128 | 1 | 16128 | 11.851 | 1350.10 | 3.007 | 42.57 | 14.857 | 1085.51 |
| 32000 | 128 | 1 | 32128 | 25.855 | 1237.66 | 3.168 | 40.40 | 29.024 | 1106.96 |
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.
No comments yet
Contribute on Hacker News ↗