Comment by mycelia
1 day ago
Hey HN! I’m Seiji Eicher from Anyscale, one of the authors of this post :) Feel free to ask questions here.
1 day ago
Hey HN! I’m Seiji Eicher from Anyscale, one of the authors of this post :) Feel free to ask questions here.
Are you using 16bit for inference? How many tokens/second if you use 8bit?
Given that SOTA models now use 4bit inference, can you do an estimation for 4bit + Blackwell?
Hi! This benchmarking was done w/ DeepSeek-V3's published FP8 weights. And Blackwell performance is still being optimized. SGLang hit 14k/s/B200 though, pretty cool writeup here: https://lmsys.org/blog/2025-09-25-gb200-part-2/
Great work! What optimizations are you most excited about for 2026?
Lot of cool stuff coming up! As a Ray developer, I focus more on the orchestration layer, so I'm excited about things like Elastic Expert Parallelism, posttraining enhancements like colocated trainer/engines, and deploying DSV4 (rumors are the architecture will be complex). vLLM roadmap is here for reference: http://roadmap.vllm.ai/
Do you use agentic AI yet for this type of optimization work or no?
For my work personally, agentic AI usage is pretty standard SWE fare (Cursor/CC). Even within the engine, optimizations are often centered around things like increasing communication/compute overlap (this is called Dual-Batch Overlap in vLLM).
Probably there are more interesting/easily verifiable agent loops you could try for kernel optimizations. At this point, the best are still written by hand, though. Ex: DeepEP kernels https://github.com/deepseek-ai/DeepEP