Comment by mccoyb

12 hours ago

It's fascinating to think about the space of problems which are amenable to RL scaling of these probability distributions.

Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.

One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?

Crazy times.

A bit related: open weights models are basically time capsules. These models have a knowledge cut off point and essentially forever live in that time.

  • This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale. However, if you viewed them on some really large macro time scale where now LLMs are injecting information into the universe and the re-ingesting that maybe in some very philosophical way they are a /very/ slow oscillating intelligence right now. And as we narrow that gap (maybe with a totally new non-LLM paradigm) perhaps that is ultimately what gen AI becomes. Or some new insight that lets the models update themselves in some fundamental way without the insanely expensive training costs they have now.

    •   > This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale.
      

      All major LLMs today have a nontrivial context window. Whether or not this constitutes "a meaningful timescale" is application dependant - for me it has been more than adequate.

      I also disagree that this has any bearing on whether or not "the machine is intelligent" or whether or not "submarines can swim".

    • There's nothing to say that you can't build something intelligent out of them by bolting a memory on it, though.

      Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.

      4 replies →

  • Not an expert but surely it's only a matter of time until there's a way to update with the latest information without having to retrain on the entire corpus?

    • On a technical level, sure, you could say it's a matter of time, but that could mean tomorrow, or in 20 years.

      And even after that, it still doesn't really solve the intrinsic problem of encoding truth. An LLM just models its training data, so new findings will be buried by virtue of being underrepresented. If you brute force the data/training somehow, maybe you can get it to sound like it's incorporating new facts, but in actuality it'll be broken and inconsistent.

    • It’s an extremely difficult problem, and if you know how to do that you could be a billionaire.

      It’s not impossible, obviously—humans do it—but it’s not yet certain that it’s possible with an LLM-sized architecture.

      2 replies →

  • I enjoyed chatting to Opus 3 recently around recent world events, as well as more recent agentic development patterns etc

Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.

I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.

  • Agreed, there's no doubt this will happen. It's likely already happening (it feels safe to assume that Anthropic is curating data from the data they record from Claude Code?)

    As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.

    Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?

    My prior is that these companies will take this data without asking you as much as they can.

    • Exactly, or functionally equivalently, asking you in paragraph 37 of a 120-page PDF (bonus points: in an agreement update).

      And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...

  • The site arena.ai does exactly this already, as far as I can tell. (In addition to the whole ranking thing.)

  • > Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.

    Wouldn't this lead to model collapse?

My understanding, from listening/reading what top researchers are saying, is that model architectures in the near future are going to attempt to scale the context window dramatically. There's a generalized belief that in-context learning is quite powerful and that scaling the window might yield massive benefits for continual learning.

It doesn't seem that hard because recent open weight models have shown that the memory cost of the context window can be dramatically reduced via hybrid attention architectures. Qwen3-next, Qwen3.5, and Nemotron 3 Nano are all great examples. Nemotron 3 Nano can be run with a million token context window on consumer hardware.

  • I don't disagree with this, but I don't think the memory cost is the only issue right? I remember using Sonnet 4.5 (or 4, I can't remember the first of Anthropic's offerings with a million context) and how slow the model would get, how much it wanted to end the session early as tokens accrued (this latter point, of course, is just an artifact of bad training).

    Less worried about memory, more worried about compute speed? Are they obviously related and is it straightforward to see?

    • The compute speed is definitely correlated with the memory consumption in LLM land. More efficient attention means both less memory and faster inference. Which makes sense to me because my understanding is that memory bandwidth is so often the primary bottleneck.

      We're also seeing a recent rise in architectures boosting compute speed via multi-token prediction (MTP). That way a single inference batch can produce multiple tokens and multiply the token generation speed. Combine that with more lean ratios of active to inactive params in MOE and things end up being quite fast.

      The rapid pace of architectural improvements in recent months seems to imply that there are lots of ways LLMs will continue to scale beyond just collecting and training on new data.

    • The parent commentator is a bit confused - most of the innovation in these hybrid architectures comes from reducing the computation pressure not just the memory pressure.

> In 2030, how is Anthropic going to keep Claude "up-to-date"

I think the majority of research, design and learning goes through LLMs and coding agents today, considering the large user base and usage it must be trillions of tokens per day. You can take a long research session or a series of them and apply hindsight - what idea above can be validated below? This creates a dense learning signal based on validation in real world with human in the loop and other tools, code & search.

> In 2030, how is Anthropic going to keep Claude "up-to-date"

In 2030 Anthropic hopes Claude will keep Anthropic "up-to-date" on its progress on itself.

I'm only half joking here.

> Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.

Part of it comes down to “knowing” what questions to ask.

  • I see it like the relationship between a student and research advisor. The advisor will ideally know the terrain and suggest a fruitful line of attack (what to ask), and the student will follow through, learning along the way.

That’s AGI, right? For the model to learn novel things itself and retain it?

I have no idea but I’m along for the ride!

> how these models are going to keep up with the expanding boundary of science

The same way humans do?

The phraseology in this comment: 'probability distributions', 'baked these patterns' IMO has all the trappings of the stochastic parrot-style HN-discourse that has been consistently wrong for almost a decade now.

The reference to how AI will keep up with AI-assisted human progress in science in 2030 is meant to reassure. It contains a number of premises that we have no business being confident in. We are potentially witnessing the obviation of human cognitive labor.

  • Sorry, are you familiar with what a next token distribution is, mathematically speaking?

    If you are not, let me introduce you to the term: a probability distribution.

    Just because it has profound properties ... doesn't make it different.

    > has all the trappings of the stochastic parrot-style HN-discourse that has been consistently wrong for almost a decade now

    Perhaps respond to my actual comment compared to whatever meta-level grouping you wish to interpret it as part of?

    > It contains a number of premises that we have no business being confident in. We are potentially witnessing the obviation of human cognitive labor.

    What premises? Be clear.