← Back to context

Comment by aspenmartin

5 days ago

The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?

Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

> The “good deal of evidence” is everywhere. The proof is in the pudding.

I'm open! Please, by all means.

> the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?

The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

> Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.

Okay.

> That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).

This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

  • > The “good deal of evidence” is everywhere. The proof is in the pudding. I'm open! Please, by all means.

    sure here are but a few: [1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)

    [2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.

    [3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos

    [4] MathArena — still good perf on uncontaminated 2025 AIME problems

    [5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually

    [6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…

    people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

    > The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

    Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.

    > This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

    "They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.

    > As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

    I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.

    > It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

    >> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

    >> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

    The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

    What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?

    • Thanks, this is great and I'll have quite a bit to read here.

      > people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

      I don't think it's goal post moving to acknowledge improvements but still reject the conclusion that AI has reached a specific milestone if those improvements don't justify the position. I doubt anyone sensible is rejecting improvements.

      > But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”.

      I don't think I've ever made any definitive claims at all, quite the contrary - I've tried to express exactly how open I am to what you're saying. As I've said, I'm a functionalist, and I already am largely supportive of reductive intelligence, so I'm exactly the type of person who would be sympathetic to what you're saying.

      > "That’s how science is done" is a bit of an oversimplification

      Of course, but I don't think it's too much to ask for to have a theory and evidence. I don't need a lined up series of papers that all start with perfectly syllogisms and then map to well controlled RCTs or whatever. Just an "I think this accounts for it, here's how I support that".

      > The claim that there is no framework + no real tests is just not true anymore.

      I didn't say it wasn't true, to be clear, I asked for it. Again, I'm sympathetic to the view at a glance so I simply need a way to reason about it.

      No need for a complete view, I'd never expect such a thing.

      > The model is: reasoning is not inherently human, it’s mathematical.

      Well, hand wringing perhaps, but I'd say it's maybe mathematical, computational, structural, functional, whatever - I think we're on the same page here regardless.

      > It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

      Sure, but I grant that, in fact I believe it entirely. But that doesn't mean that every mathematical construct exhibits the function of intelligence.

      > What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above?

      Sorry, I'm not fully understanding this framing. We can do those things with LLMs, and it's hard to say what I would be satisfied. In general, I'd be satisfied with a theory that (a) accounts for the data (b) has supporting evidence (c) does not contradict any major prior commitments. I don't think (c) will be an issue here.

      > You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd,

      Because intelligence could have been a property of our brains being wet, or roundish, or it could have been a property of our spines, or maybe some force we hadn't discovered, or a soul, etc. We formed a theory, it accounted for observations, we performed tests, we've modeled things, etc, and so the theories we've adopted have been extremely successful and I think hold up quite well. But certainly we didn't go "the brain has electricity, the brain is intelligent, therefor electricity in the brain is what drives intelligence".

      > Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?

      Certainly nothing on my world view.