Comment by imtringued
2 months ago
Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.
In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.
This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.
No comments yet
Contribute on Hacker News ↗