Comment by aspenmartin

10 hours ago

Combinatorial explosion? What do you mean? Again, your experiences are true, but they are improving with each release. The error rate on tasks continues to go down, even novel tasks (as far as we can measure them). Again this is where verifiable domains come in -- whatever problems you can specify the model will improve on them, and this improvement will result in better generalization, and improvements on unseen tasks. This is what I mean by taking your observations of today, ignoring the rate of progress that got us here and the known scaling laws, and then just asserting there will be some fundamental limitation. My point is while this idea may be common, it is not at all supported by literature and the mathematics.

The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.

How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?

Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."

  • So what do you think the difference is between humans and an agent in this respect? What makes you think this has any relevance to the problem? everything is combinatorially explosive: the combination of words that we can string into sentences and essays is also combinatorially explosive and yet LLMs and humans have no problem with it. It's just the wrong frame of thinking for what's going on. These systems are obtaining higher and higher levels of abstractions because that is the most efficient thing for them to do to gain performance. That's what reasoning looks like: compositions of higher level abstractions. What you say may be true but I don't see how this is relevant.

    "There is no sense in which they are mathematically destined to eventually program well"

    - Yes there is and this belies and ignorance of the literature and how things work

    - Again: RL has been around forever. Scaling laws have held empirically up to the largest scales we've tested. There are known RL scaling laws for both training and test time. It's ludicrous to state there is "no sense" in this, on the contrary, the burden of proof of this is squarely on yourself because this has already been studied and indeed is the primary reason why we're able to secure the eye-popping funding: contrary to popular HN belief, a trillion dollars of CapEx spend is based on rational evidence-based decision making.

    > "How good are these types of algorithms at generalization"

    There is a tremendously large literature and history of this. ULMFiT, BERT ==> NLP task generalization; https://arxiv.org/abs/2206.07682 ==> emergent capabilities, https://transformer-circuits.pub/2022/in-context-learning-an... ==> demonstrated circuits for in context learning as a mechanism for generalization, https://arxiv.org/abs/2408.10914 + https://arxiv.org/html/2409.04556v1 ==> code training produces downstream performance improvements on other tasks

    > Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."

    You say this and ignore my entire argument: you are right about all of your observations, yet

    - Opus 4.6 compared to Sonnet 3.x is clearly more generalizable and less prone to these mistakes

    - Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop and our recursive improvement loop will die off. Verifiable domains mean that we are in alphago land, we're learning by doing and not by mimicking human data or memorizing a training set.

    • Hey man, it sounds like you're getting frustrated. I'm not ignoring anything; let's have a reasonable discussion without calling each other ignorant. I don't dispute the value of these tools nor that they're improving. But the no free lunch theorem is inexorable so the question is where this improvement breaks down - before or beyond human performance on programming problems specifically.

      What difference do I think there is between humans and an agent? They use different heuristics, clearly. Different heuristics are valuable on different search problems. It's really that simple.

      To be clear, I'm not calling either superior. I use agents every day. But I have noticed that claude, a SOTA model, makes basic logic errors. Isn't that interesting? It has access to the complete compendium of human knowledge and can code all sorts of things in seconds that require my trawling through endless documentation. But sometimes it forgets that to do dirty tracking on a pure function's output, it needs to dirty-track the function's inputs.

      It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?

      2 replies →

    • > So what do you think the difference is between humans and an agent in this respect?

      Humans learn.

      Agents regurgitate training data (and quality training data is increasingly hard to come by).

      Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.

      > Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.

      Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.

      For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.

      1 reply →