Comment by staticassertion

5 days ago

I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.

It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.

A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:

1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).

2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.

I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.

I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...

> I think "novel" is ill defined here

That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.

As for evidence that they can do novel things, there is plenty:

1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.

2. SVGs of pelicans riding bicycles

3. People use LLMs to write new apps/code every day

4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available

5. LLMs have solved open problems in physics and mathematics [0,1]

That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).

[0] https://news.ycombinator.com/item?id=47497757

The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?

Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

  • > The “good deal of evidence” is everywhere. The proof is in the pudding.

    I'm open! Please, by all means.

    > the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?

    The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

    > Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.

    Okay.

    > That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).

    This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

    As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

    > It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

    Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

    So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

    • > The “good deal of evidence” is everywhere. The proof is in the pudding. I'm open! Please, by all means.

      sure here are but a few: [1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)

      [2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.

      [3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos

      [4] MathArena — still good perf on uncontaminated 2025 AIME problems

      [5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually

      [6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…

      people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

      > The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

      Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.

      > This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

      "They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.

      > As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

      I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.

      > It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

      >> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

      >> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

      The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

      What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?

      1 reply →