Comment by Zambyte

2 months ago

My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.

2 comments

Zambyte

p1esk 2 months ago

The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.