← Back to context

Comment by crystal_revenge

2 days ago

> massive application of reinforcement learning techniques

So sad that "reinforcement learning" is another term whose meaning has been completely destroyed by uneducated hype around LLMs (very similar to "agents"). 5 years ago nobody familiar with RL would consider what these companies are doing as "reinforcement learning".

RLHF and similar techniques are much, much closer to traditional fine-tuning than they are reinforcement learning. RL almost always, historically, assumes online training and interaction with an environment. RLHF is collecting data from user and using it to reach the LLM to be more engaging.

This fine-tuning also doesn't magically transform LLMs into something different, but it is largely responsible for their sycophantic behavior. RLHF makes LLMs more pleasing to humans (and of course can be exploited to help move the needle on benchmarks).

It's really unfortunate that people will throw away their knowledge of computing in order to maintain a belief that LLMs are something more than they are. LLMs are great, very useful, but they're not producing "nontrivial emergent phenomena". They're increasing trained a products to invoked increase engagement. I've found LLMs less useful in 2025 than in 2024. And the trend in people not opening them up under the hood and playing around with them to explore what they can do has basically made me leave the field (I used to work in AI related research).

I wasn't referring to RLHF, which people were of course already doing heavily in 2023, but RLVR, aka LLMs solving tons of coding and math problems with a reward function after pre-training. I discussed that in another reply, so I won't repeat it here; instead I'd just refer you to Andrej Karpathy's 2025 LLM Year in Review which discusses it. https://karpathy.bearblog.dev/year-in-review-2025/

> I've found LLMs less useful in 2025 than in 2024.

I really don't know how to reply to this part without sounding insulting, so I won't.

  • While RLVF is neat, it still is an 'offline' learning model that just borrows a reward function similar to RL.

    And did you not read the entire post? Karpathy basically calls out the same point that I am making regarding RL which "of course can be exploited to help move the needle on benchmarks":

    > Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form

    Regarding:

    > I really don't know how to reply to this part without sounding insulting, so I won't.

    Relevant to citing him: Karpathy has publicly praised some of my past research in LLMs, so please don't hold back your insults. A poster on HN telling me I'm "not using them right!!!" won't shake my confidence terribly. I use LLMs less this year than last year and have been much more productive. I still use them, LLMs are interesting, and very useful. I just don't understand why people have to get into hysterics trying to make them more than that.

    I also agree with Karpathy's statement:

    > In any case they are extremely useful and I don't think the industry has realized anywhere near 10% of their potential even at present capability.

    But magical thinking around them is slowing down progress imho. Your original comment itself is evidence of this:

    > I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely.

    I would say "Rip them open! Start playing around with the internals! Mess around with sampling algorithms! Ignore the 'win market share' hype and benchmark gaming and see just what you can make these models do!" Even if restricted to just open, relatively small models, there's so much more interesting work in this space.

    • RLVR is not offline learning. It's not learning from a static dataset. These are live rollouts that are being verified and which update the weights at each pass based on feedback from the environment.

      You might argue that traditional RL involves multiple states the agent moves through. But autoregressive LLMs are the same: a forward pass generating a token also creates change in state.

      After training, the weights are fixed, of course, but that is the case of most traditional RL systems. RL does not intrinsically mean a continual updating of weights in deployment, which carries a bunch of problems.

      From the premise that RLVR can be applied to benchmaxx (true!) it does not follow that it therefore is only good for that.

    • What do you think about Geoffrey Hinton's concerns about the AI (minus "AGI")? Do you agree with those concerns or do you believe that LLMs are only that much "useful" so they wouldn't impose a risk on our society?