Comment by arjie
3 hours ago
Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.
3 hours ago
Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.
I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477
Not OP, but I am seeing up to 260 tokens/second output at c=1 with the recipe at https://github.com/local-inference-lab/rtx6kpro/blob/master/... using 4x 6k cards.
There may be a way to get the 2-bit quantized version running even faster on a pair of them.