← Back to context

Comment by mquander

7 days ago

I'm pretty much an AI layperson but my basic understanding of how LLMs usually run on my or your box is:

1. You load all the weights of the model into GPU VRAM, plus the context.

2. You construct a data structure called the "KV cache" representing the context, and it hopefully stays in the GPU cache.

3. For each token in the response, for each layer of the model, you read the weights of that layer out of VRAM and use them plus the KV cache to compute the inputs to the next layer. After all the layers you output a new token and update the KV cache with it.

Furthermore, my understanding is that the bottleneck of this process is usually in step 3 where you read the weights of the layer from VRAM.

As a result, this process is very parallelizable if you have lots of different people doing independent queries at the same time, because you can have all their contexts in cache at once, and then process them through each layer at the same time, reading the weights from VRAM only once.

So once you got the VRAM it's much more efficient for you to serve lots of people's different queries than for you to be one guy doing one query at a time.