Comment by highfrequency
14 days ago
Crazy that there are now five and a half companies that all have roughly state of the art LLMs.
> We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales. We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens.
This sounds interesting. Anyone have a link to the paper or other documentation on MetaP?
It's quite similar to muP
https://github.com/microsoft/mup