Comment by fancyfredbot

7 days ago

Have you looked at what happens to tokens per second when you increase batch size? The cost of serving 128 queries at once is not 128x the cost of serving one query.

This. the main trick, outside of just bigger hardware, is smart batching. E.g. if one user asks why the sky is blue, the other asks what to make for dinner, both queries go though the same transformer layers, same model weights so they can be answered concurrently for very little extra GPU time. There's also ways to continuously batch requests together so they don't have to be issued at the same time.