Comment by xyhopguy
20 hours ago
Microbatch size is a hyperparameter, it can be set to 1 and work just as effectively. With gradient accumulation it's equivalent even. Large batch sizes are used to increase parallelism, and sometimes to reduce variance in the loss signal (at the cost of increased bias).
Batch size is frequently limited by compute bottlenecks well before memory.
No comments yet
Contribute on Hacker News ↗