Comment by Majromax
3 months ago
The basic MLP block in this model uses a ReLU^2 activation function (x <- ReLU(x)^2). That seems to be copied from the nanochat project, and it's not present in nanoGPT. Is there some documentation on the choice of this activation function?
Isn't it because ReLU is cheap and ^2 is squared loss?
When it comes to compute cost the choice of activation function makes little difference nowadays (and it can often be fused with whatever operation comes before it, which makes it effectively free).
The real reason is simple: it was inherited.
The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.
[1] -- https://github.com/KellerJordan/modded-nanogpt
There has been some experimentation with the use of ReLU^2 in language models in recent years, e.g., here: https://proceedings.neurips.cc/paper_files/paper/2021/file/2...