Comment by arjie

3 hours ago

Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.

2 comments

arjie

mtone 2 minutes ago

I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477

CamperBob2 21 minutes ago

Not OP, but I am seeing up to 260 tokens/second output at c=1 with the recipe at https://github.com/local-inference-lab/rtx6kpro/blob/master/... using 4x 6k cards.

There may be a way to get the 2-bit quantized version running even faster on a pair of them.