Comment by dTal

3 hours ago

The general case is that our own current relative ignorance on the best way to use and adapt pretrained weights is a short-lived anomaly caused by an abundance of funding to train models from scratch, a rapid evolution of training strategies and architectures, and a mad rush to ship hot new LLMs as fast as possible. But even as it is, the things you mentioned are not impossible, they are easy, and we are only going to get better at them.

>What if you need to reduce number of layers

Delete some.

> and/or width of hidden layers?

Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.

> would the process of "layers to add" selection be considered training?

Er, no?

> What if you still have to obtain the best result possible for given coefficient/tokenization budget?

We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.

0 comments

dTal

No comments yet

Contribute on Hacker News ↗