Comment by himata4113
7 hours ago
There's a limit to how much you can "scale" this process, it's linear, but if we did napkin math based on vllm parallel batched streams only lose around ~50% performance compared to single-stream output so doesn't explain the ridicioulusly fast numbers here.
I wish google just came out and told us how large their flash model is, because if it's as big or smaller than gpt-5.4-nano that's the real headline here.
No comments yet
Contribute on Hacker News ↗