Comment by kouteiheika
3 days ago
> modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
This might be underselling it a little bit. The difference between GPT2 and Qwen3 is maybe, I don't know, ~20 lines of code difference if you write it well? The biggest difference is probably RoPE (which can be tricky to wrap your head around); the rest is pretty minor.
There’s Grouped Query Attention as well, a different activation function, and a bunch of not very interesting norms stuff. But yeah, you’re right - still very similar overall.