Comment by zxexz
2 months ago
I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.
2 months ago
I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.
H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.
You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.
Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.
You could do it on one node of 8xMI300x and cut your costs down.
Using vllm?
Oh, SGLang. Had to make a couple modifications, I forget what they were, nothing crazy. Lots of extra firmware, driver and system config too.