Comment by brcmthrowaway

2 months ago

How was reinforcement learning used as a gamechanger?

What happens to an LLM without reinforcement learning?

6 comments

brcmthrowaway

The essence of it is that after the "read the whole internet and predict the next token" pre-training step (and the chat fine-tuning), SotA LLMs now have a training step where they solve huge numbers of tasks that have verifiable answers (especially programming and math). The model therefore gets the very broad general knowledge and natural language abilities from pre-training and gets good at solving actual problems (problems that can't be bullshitted or hallucinated through because they have some verifiable right answer) from the RL step. In ways that still aren't really understood, it develops internal models of mathematics and coding that allow it to generalize to solve things it hasn't seen before. That is why LLMs got so much better at coding in 2025; the success of tools like Claude Code (to pick just one example) is built upon it. Of course, the LLMs still have a lot of limitations (the internal models are not perfect and aren't like how humans think at all), but RL has taken us pretty far.

Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/

brcmthrowaway 2 months ago
Do you have the answer to the second question? Is an LLM trained on the internet just GPT-3?
- libraryofbabel 2 months ago
  
  I don't know - perhaps someone who's more of an expert or who's worked a lot with open source models that haven't been RL-ed can weigh in here!
  But certainly without the RL step, the LLM would be much worse at coding and would hallucinate more.

malaya_zemlya 2 months ago

You can download a base model (aka foundation, aka pretrain-only) from huggingface and test it out. These were produced without any RL.

However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:

  6. **You will win millions playing bingo.**
     - **Sentiment Classification: Positive**
     - **Reasoning:** This statement is positive as it suggests a highly favorable outcome for the person playing bingo.

This is not your typical internet page.

octoberfranklin 2 months ago

You often can see the exact details in their model cards.
Bwahahahaaha. Lol.
/me falls off of chair laughing
Come on, I've never found "exact details" about anything in a model card, except maybe the number of weights.

HarHarVeryFunny 1 month ago

A base LLM that has only been pre-trained (no RL = reinforcement learning), is not "planning" very far ahead. It has only been trained to minimize prediction errors on the next word it is generating. You might consider this a bit like a person who speaks before thinking/planning, or a freestyle rapper spitting out words so fast they only have time to maintain continuity with what they've just said, not plan ahead.

The purpose of RL (applied to LLMs as a second "post-training" stage after pre-training) is to train the LLM to act as if it had planned ahead before "speaking", so that rather than just focusing on the next word it will instead try to choose a sequence of words that will steer the output towards a particular type of response that had been rewarded during RL training.

There are two types of RL generally applied to LLMs.

1) RLHF - RL from Human Feedback, where the goal is to generate responses that during A/B testing humans had indicated a preference for (for whatever reason).

2) RLVR - RL with Verifiable Rewards, used to promote the appearance of reasoning in domains like math and programming where the LLM's output can be verified in someway (e.g. math result or program output checked).

Without RLHF (as was the case pre-ChatGPT) the output of an LLM can be quite unhinged. Without RLVR, aka RL for reasoning, the abilty of the model to reason (or give the appearance of reasoning) is a function of pre-training, and won't have the focus (like putting blinkers on a horse) to narrow generative output to achieve the desired goal.