Comment by kingstnap
1 day ago
Impressive performance work. It's interesting that you still see these 40+% perf gains like this.
Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.
1 day ago
Impressive performance work. It's interesting that you still see these 40+% perf gains like this.
Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.
vLLM needs to perform similar operations to an operating system. If you write an operating system in Python you will have scope for many 40% improvements all over the place and in the end it won’t be Python anymore, at least under the hood it won’t be.
It's not about the python at all. Optimization techniques are on a completely different level, on the level of the chip and/or hw platform and finding ways to utilize them in a max manner by exploiting the intrinsic details about their limitations.
Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.