Comment by kouteiheika

7 hours ago

This isn't really anything new; I've been doing something like this for quite a while, I just haven't bothered writing a paper. (: Probably anyone who would seriously tackle the problem of "how do I train a huge model on a tiny amount of VRAM?" would come up with something similar.

However, most people in the field don't, because the actual practical utility of training huge models on a single GPU is quite low. (e.g they got 341 tok/s for a 14B model on a single 3090 while with my method I was getting ~1k tok/s on a single 4090; that's still very slow)

Also, there are more tricks one can use to speed up training/lower VRAM usage which they're not using. For example, you don't need any gradient offloading (you can just accumulate the gradients directly into the optimizers' states if you modify your optimizer), you can use Muon instead of Adam (which needs only half of VRAM of Adam), you can use quantization (both for parameters and for the optimizer states; e.g. I found Muon quantized into 4-bit working relatively well), etc.

7 comments

kouteiheika

stevemk14ebr 6 hours ago

As the saying goes, POC or GTFO

I invented faster than light travel, it was obvious, just didn't write a paper yet either :)

sabedevops 6 hours ago

Can you take the time to write your methods? I’d be interested in reading it

vlovich123 7 hours ago

341 is two orders of magnitude faster than your 1 tok/s so it doesn’t seem like their stuff is all that obvious. I also have no baseline for training to know if 341tok/s is slow but it seems speedy for a 3090.

bastawhiz 7 hours ago

OP said 1k, not 1
SubiculumCode 6 hours ago

:) Coffee is good
rolandr 7 hours ago

1k tok/s = 1000 tok/s...
thrawa8387336 6 hours ago

OOM is log10