Comment by thesz

1 day ago

Multiply "inference + backwards pass (~2x inference cost) + activations (vram overhead)" by batch size (thousands) to get to the actual RAM and compute cost. Optimizer like ADAM adds only two or three model-sized overhead.

And last, but not least, you need only one hidden layer kept in RAM for inference, but you need all of them (61 for Deepseek models) kept in RAM for computing gradient for one sample.

3 comments

thesz

xyhopguy 20 hours ago

Microbatch size is a hyperparameter, it can be set to 1 and work just as effectively. With gradient accumulation it's equivalent even. Large batch sizes are used to increase parallelism, and sometimes to reduce variance in the loss signal (at the cost of increased bias).

Batch size is frequently limited by compute bottlenecks well before memory.

mcv 11 hours ago

And of course you do all of this for every object in your training set, which is going to be larger than the total number of uses for any individual user.

galaxyLogic 20 hours ago

Does it matter what is the difference in size of needed inputs for inference vs. training?