Comment by kouteiheika

3 days ago

> modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.

This might be underselling it a little bit. The difference between GPT2 and Qwen3 is maybe, I don't know, ~20 lines of code difference if you write it well? The biggest difference is probably RoPE (which can be tricky to wrap your head around); the rest is pretty minor.

There’s Grouped Query Attention as well, a different activation function, and a bunch of not very interesting norms stuff. But yeah, you’re right - still very similar overall.