Comment by libraryofbabel
2 days ago
I read this article back when I was learning the basics of transformers; the visualizations were really helpful. Although in retrospect knowing how a transformer works wasn't very useful at all in my day job applying LLMs, except as a sort of deep background for reassurance that I had some idea of how the big black box producing the tokens was put together, and to give me the mathematical basis for things like context size limitations etc.
I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.
Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .
Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?
> massive application of reinforcement learning techniques
So sad that "reinforcement learning" is another term whose meaning has been completely destroyed by uneducated hype around LLMs (very similar to "agents"). 5 years ago nobody familiar with RL would consider what these companies are doing as "reinforcement learning".
RLHF and similar techniques are much, much closer to traditional fine-tuning than they are reinforcement learning. RL almost always, historically, assumes online training and interaction with an environment. RLHF is collecting data from user and using it to reach the LLM to be more engaging.
This fine-tuning also doesn't magically transform LLMs into something different, but it is largely responsible for their sycophantic behavior. RLHF makes LLMs more pleasing to humans (and of course can be exploited to help move the needle on benchmarks).
It's really unfortunate that people will throw away their knowledge of computing in order to maintain a belief that LLMs are something more than they are. LLMs are great, very useful, but they're not producing "nontrivial emergent phenomena". They're increasing trained a products to invoked increase engagement. I've found LLMs less useful in 2025 than in 2024. And the trend in people not opening them up under the hood and playing around with them to explore what they can do has basically made me leave the field (I used to work in AI related research).
I wasn't referring to RLHF, which people were of course already doing heavily in 2023, but RLVR, aka LLMs solving tons of coding and math problems with a reward function after pre-training. I discussed that in another reply, so I won't repeat it here; instead I'd just refer you to Andrej Karpathy's 2025 LLM Year in Review which discusses it. https://karpathy.bearblog.dev/year-in-review-2025/
> I've found LLMs less useful in 2025 than in 2024.
I really don't know how to reply to this part without sounding insulting, so I won't.
While RLVF is neat, it still is an 'offline' learning model that just borrows a reward function similar to RL.
And did you not read the entire post? Karpathy basically calls out the same point that I am making regarding RL which "of course can be exploited to help move the needle on benchmarks":
> Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form
Regarding:
> I really don't know how to reply to this part without sounding insulting, so I won't.
Relevant to citing him: Karpathy has publicly praised some of my past research in LLMs, so please don't hold back your insults. A poster on HN telling me I'm "not using them right!!!" won't shake my confidence terribly. I use LLMs less this year than last year and have been much more productive. I still use them, LLMs are interesting, and very useful. I just don't understand why people have to get into hysterics trying to make them more than that.
I also agree with Karpathy's statement:
> In any case they are extremely useful and I don't think the industry has realized anywhere near 10% of their potential even at present capability.
But magical thinking around them is slowing down progress imho. Your original comment itself is evidence of this:
> I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely.
I would say "Rip them open! Start playing around with the internals! Mess around with sampling algorithms! Ignore the 'win market share' hype and benchmark gaming and see just what you can make these models do!" Even if restricted to just open, relatively small models, there's so much more interesting work in this space.
2 replies →
I agree and disagree. In my day job as an AI engineer I rarely if ever need to use any “classic” deep learning to get things done. However, I’m a firm believer that understanding the internals of a LLM can set you apart as an gen AI engineer, if you’re interested in becoming the top 1% in your field. There can and will be situations where your intuition about the constraints of your model is superior compared to peers who consider the LLM a black box. I had this advice given directly to me years ago, in person, by Clem Delangue of Hugging Face - I took it seriously and really doubled down on understanding the guts of LLMs. I think it’s served me well.
I’d give similar advice to any coding bootcamp grad: yes you can get far by just knowing python and React, but to reach the absolute peak of your potential and join the ranks of the very best in the world in your field, you’ll eventually want to dive deep into computer architecture and lower level languages. Knowing these deeply will help you apply your higher level code more effectively than your coding bootcamp classmates over the course of a career.
I suppose I actually agree with you, and I would give the same advice to junior engineers too. I've spent my career going further down the stack than I really needed to for my job and it has paid off: everything from assembly language to database internals to details of unix syscalls to distributed consensus algorithms to how garbage collection works inside CPython. It's only useful occasionally, but when it is useful, it's for the most difficult performance problems or nasty bugs that other engineers have had trouble solving. If you're the best technical troubleshooter at your company, people do notice. And going deeper helps with system design too: distributed systems have all kinds of subtleties.
I mostly do it because it's interesting and I don't like mysteries, and that's why I'm relearning transformers, but I hope knowing LLM internals will be useful one day too.
Wouldn't you say that people who pursue deep architectural knowledge should just go down the AI Researcher career track? I feel like that's where that sort of knowledge actualy matters.
I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.
An example of why a basic understanding is helpful:
A common sentiment on HN is that LLMs generate too many comments in code.
But comment spam is going to help code quality, due to the way causal transformers and positional encoding works. The model has learned to dump locally-specific reasoning tokens where they're needed, in a tightly scoped cluster that can be attended to easily, and forgetting about just as easily later on. It's like a disposable scratchpad to reduce the errors in the code it's about to write.
The solution to comment spam is textual/AST post-processing of generated code, rather than prompting the LLM to handicap itself by not generating as much comments.
Unless you have evidence from a mechanistic interpretability study showing what's happening inside the model when it creates comments, this is really only a plausible-sounding just-so story.
Like I said, it's a trap to reason from architecture alone to behavior.
Yes I should have made it clear that it is an untested hypothesis.
You’re describing this like if you actually knew what’s going on in these models. In reality it’s just a guess and not a very convincing one.
An example of why a basic understanding is helpful:
A common sentiment on HN is that LLMs generate too many comments in code.
For good reason -- comment sparsity improves code quality, due to the way causal transformers and positional encoding work. The model has learned that real, in-distribution code carries meaning in structure, naming, and control flow, not dense commentary. Fewer comments keep next-token prediction closer to the statistical shape of the code it was trained on.
Comments aren’t a free scratchpad. They inject natural-language tokens into the context window, compete for attention, and bias generation toward explanation rather than implementation, increasing drift over longer spans.
The solution to comment spam isn’t post-processing. It’s keeping generation in-distribution. Less commentary forces intent into the code itself, producing outputs that better match how code is written in the wild, and forcing the model into more realistic context avenues.
Literally the exact thing I tell new hires on projects for training models: theory is far less important than practice.
We are only just beginning to understand how these things work. I imagine it will end up being similar to Freud’s Oedipal complex: when we failed to have a fully physical understanding of cognition, we employed a schematic narrative. Something similar is already emerging.
> would never be able to perform well on novel coding or mathematics tasks. We were wrong
I'm not clear at all we were wrong. A lot of the mathematics announcements have been rolled back and "novel coding" is exactly where the LLMs seem to fail on a daily basis - things that are genuinely not represented in the training set.
Nice video o mechanical interpretability from Welch Labs:
https://youtu.be/D8GOeCFFby4?si=2rWnwv4M2bjkpEoc
Maybe the most benefits are from the condition that people can read another new paper with enough background knowledge.
It is almost like understanding wood at a molecular level and being a carpenter. It also may help the carpentery, but you cam be a great one without it. And a bad one with the knowledge.
How was reinforcement learning used as a gamechanger?
What happens to an LLM without reinforcement learning?
The essence of it is that after the "read the whole internet and predict the next token" pre-training step (and the chat fine-tuning), SotA LLMs now have a training step where they solve huge numbers of tasks that have verifiable answers (especially programming and math). The model therefore gets the very broad general knowledge and natural language abilities from pre-training and gets good at solving actual problems (problems that can't be bullshitted or hallucinated through because they have some verifiable right answer) from the RL step. In ways that still aren't really understood, it develops internal models of mathematics and coding that allow it to generalize to solve things it hasn't seen before. That is why LLMs got so much better at coding in 2025; the success of tools like Claude Code (to pick just one example) is built upon it. Of course, the LLMs still have a lot of limitations (the internal models are not perfect and aren't like how humans think at all), but RL has taken us pretty far.
Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/
Do you have the answer to the second question? Is an LLM trained on the internet just GPT-3?
1 reply →
A base LLM that has only been pre-trained (no RL = reinforcement learning), is not "planning" very far ahead. It has only been trained to minimize prediction errors on the next word it is generating. You might consider this a bit like a person who speaks before thinking/planning, or a freestyle rapper spitting out words so fast they only have time to maintain continuity with what they've just said, not plan ahead.
The purpose of RL (applied to LLMs as a second "post-training" stage after pre-training) is to train the LLM to act as if it had planned ahead before "speaking", so that rather than just focusing on the next word it will instead try to choose a sequence of words that will steer the output towards a particular type of response that had been rewarded during RL training.
There are two types of RL generally applied to LLMs.
1) RLHF - RL from Human Feedback, where the goal is to generate responses that during A/B testing humans had indicated a preference for (for whatever reason).
2) RLVR - RL with Verifiable Rewards, used to promote the appearance of reasoning in domains like math and programming where the LLM's output can be verified in someway (e.g. math result or program output checked).
Without RLHF (as was the case pre-ChatGPT) the output of an LLM can be quite unhinged. Without RLVR, aka RL for reasoning, the abilty of the model to reason (or give the appearance of reasoning) is a function of pre-training, and won't have the focus (like putting blinkers on a horse) to narrow generative output to achieve the desired goal.
You can download a base model (aka foundation, aka pretrain-only) from huggingface and test it out. These were produced without any RL.
However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:
This is not your typical internet page.
You often can see the exact details in their model cards.
Bwahahahaaha. Lol.
/me falls off of chair laughing
Come on, I've never found "exact details" about anything in a model card, except maybe the number of weights.
> Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks.
I feel like there are three groups of people:
1. Those who think that LLMs are stupid slop-generating machines which couldn't ever possibly be of any use to anybody, because there's some problem that is simple for humans but hard for LLMs, which makes them unintelligent by definition.
2. Those who think we have already achieved AGI and don't need human programmers any more.
3. Those who believe LLMs will destroy the world in the next 5 years.
I feel like the composition of these three groups is pretty much constant since the release of Chat GPT, and like with most political fights, evidence doesn't convince people either way.
Those three positions are all extreme viewpoints. There are certainly people who hold them, and they tend to be loud and confident and have an outsize presence in HN and other places online.
But a lot of us have a more nuanced take! It's perfectly possible to believe simultaneously that 1) LLMs are more than stochastic parrots 2) LLMs are useful for software development 3) LLMs have all sorts of limitations and risks (you can produce unmaintainable slop with them, and many people will, there are massive security issues, I can go on and on...) 4) We're not getting AGI or world-destroying super-intelligence anytime soon, if ever 5) We're in a bubble and it's going to pop and cause a big mess 6) This tech is still going to be transformative long term, on a similar level to the web and smartphones.
Don't let the noise from the extreme people who formed their opinions back when ChatGPT came out drown out serious discussion! A lot of us try and walk a middle course with this and have been and still are open to changing our minds.