Comment by Zambyte
1 day ago
My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.
1 day ago
My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.
The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.