Comment by Majromax

3 months ago

The basic MLP block in this model uses a ReLU^2 activation function (x <- ReLU(x)^2). That seems to be copied from the nanochat project, and it's not present in nanoGPT. Is there some documentation on the choice of this activation function?

3 comments

Majromax

throwaway2027 3 months ago

Isn't it because ReLU is cheap and ^2 is squared loss?

kouteiheika 3 months ago

When it comes to compute cost the choice of activation function makes little difference nowadays (and it can often be fused with whatever operation comes before it, which makes it effectively free).
The real reason is simple: it was inherited.
The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.
[1] -- https://github.com/KellerJordan/modded-nanogpt
macleginn 3 months ago

There has been some experimentation with the use of ReLU^2 in language models in recent years, e.g., here: https://proceedings.neurips.cc/paper_files/paper/2021/file/2...