Comment by montebicyclelo

7 hours ago

> nanochat is also inspired by modded-nanoGPT

Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat

modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.

Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).

[1] https://github.com/KellerJordan/modded-nanogpt

[2] https://kellerjordan.github.io/posts/muon/

7 comments

montebicyclelo

varunneal 7 hours ago

Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training

tbalsam 6 hours ago

This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).
Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.
(Source: am experienced speedrunner who's been in these circles for a decent amount of time)
swyx 6 hours ago

sharing some useful resrources for learning Muon (since I'm also just catching up on it)
- https://x.com/leloykun/status/1846842883967692926
- https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...

echelon 6 hours ago

8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?

vessenes 6 hours ago

This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
Tepix 6 hours ago

As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.