Comment by libraryofbabel
2 days ago
It helps if you have some basic linear algebra, for sure - matrices, vectors, etc. That's probably the most important thing. You don't need to know pytorch, which is introduced in the book as needed and in an appendix. If you want to really understand the chapters on pre-training and fine-tuning you'll need to know a bit of machine learning (like a basic grasp of loss functions and gradient descent and backpropagation - it's sort of explained in the book but I don't think I'd have understood it much without having trained basic neural networks before), but that is not required so much for the earlier chapters on the architecture, e.g. how the attention mechanism works with Q, K, V as discussed in this article.
The best part about it is seeing the code built up for the GPT-2 architecture in basic pytorch, and then loading in the real GPT-2 weights and they actually work! So it's great for learning but also quite realistic. It's LLM architecture from a few years ago (to keep it approachable), but Sebastian has some great more advanced material on modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
> modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
This might be underselling it a little bit. The difference between GPT2 and Qwen3 is maybe, I don't know, ~20 lines of code difference if you write it well? The biggest difference is probably RoPE (which can be tricky to wrap your head around); the rest is pretty minor.
There’s Grouped Query Attention as well, a different activation function, and a bunch of not very interesting norms stuff. But yeah, you’re right - still very similar overall.
Thank you! Might get the book to see what I can learn from it, and see what gaps I have to research and learn more. Appreciate the detailed response.
Sure! I don't think the linear algebra pre-req is that hard if you do need to learn it, there's tons of material online to practice on and it's really just basic "apply this matrix to this vector" stuff. Most of what would be in even an undergrad intro to linear algebra course (inverting a matrix, determinants, whatever) is totally unnecessary.