Comment by ozgune
6 days ago
In March, vLLM picked up some of the improvements in the DeepSeek paper. Through these, vLLM v0.7.3's DeepSeek performance jumped to about 3x+ of what it was before [1].
What's exciting is that there's still so much room for improvement. We benchmark around 5K total tokens/s with the sharegpt dataset and 12K total token/s with random 2000/100, using vLLM and under high concurrency.
DeepSeek-V3/R1 Inference System Overview [2] quotes "Each H800 node delivers an average throughput of 73.7k tokens/s input (including cache hits) during prefilling or 14.8k tokens/s output during decoding."
Yes, DeepSeek deploys a different inference architecture. But this goes onto show just how much room there is for improvement. Looking forward to more open source!
[1] https://developers.redhat.com/articles/2025/03/19/how-we-opt...
[2] https://github.com/deepseek-ai/open-infra-index/blob/main/20...
No comments yet
Contribute on Hacker News ↗