← Back to context

Comment by aziis98

7 days ago

I think this article can be interesting:

https://www.seangoedecke.com/inference-batching-and-deepseek...

Here is an example of what happens

> The only way to do fast inference here is to pipeline those layers by having one GPU handle the first ten layers, another handle the next ten, and so on. Otherwise you just won’t be able to fit all the weights in a single GPU’s memory, so you’ll spend a ton of time swapping weights in and out of memory and it’ll end up being really slow. During inference, each token (typically in a “micro batch” of a few tens of tokens each) passes sequentially through that pipeline of GPUs