Comment by spyder
4 months ago
It's for hidden layers and not for every parameter: From Keller's Muon github page:
"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."
And I just looked into this nanochat repo and it's also how it's used here.
https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...
No comments yet
Contribute on Hacker News ↗