Comment by zxexz

2 months ago

I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.

6 comments

zxexz

zackangelo 2 months ago

H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.

DavidSJ 2 months ago

You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.

zxexz 2 months ago

Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.

latchkey 2 months ago

You could do it on one node of 8xMI300x and cut your costs down.

majke 2 months ago

Using vllm?

zxexz 2 months ago

Oh, SGLang. Had to make a couple modifications, I forget what they were, nothing crazy. Lots of extra firmware, driver and system config too.