Comment by thesz
21 hours ago
Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.
21 hours ago
Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.
Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.
And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.
To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.
Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.
That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models.
Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.
What does concern me here is that it is very hard to ablate tokenization artifacts.
None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.
[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887
You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training?
What if you still have to obtain the best result possible for given coefficient/tokenization budget?
I think that my comment express general case, while yours provide some exceptions.
The general case is that our own current relative ignorance on the best way to use and adapt pretrained weights is a short-lived anomaly caused by an abundance of funding to train models from scratch, a rapid evolution of training strategies and architectures, and a mad rush to ship hot new LLMs as fast as possible. But even as it is, the things you mentioned are not impossible, they are easy, and we are only going to get better at them.
>What if you need to reduce number of layers
Delete some.
> and/or width of hidden layers?
Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.
> would the process of "layers to add" selection be considered training?
Er, no?
> What if you still have to obtain the best result possible for given coefficient/tokenization budget?
We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.
there is evidence it is useful in some cases, but obviously no evidence it is enough if you chase to beat SOTA.