Comment by kouteiheika
6 hours ago
> Activation would still require gigabytes for a few kb context.
For that you use activation checkpointing, and you can also offload that to the CPU in a smart way to hide the latency. Although, yes, for long context training the activations do dominate the memory usage (and quantizing them degrades things more than just quantizing weights and/or optimizer states).
No comments yet
Contribute on Hacker News ↗