Comment by sobellian

8 hours ago

The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.

How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?

Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."

6 comments

sobellian

aspenmartin 8 hours ago

So what do you think the difference is between humans and an agent in this respect? What makes you think this has any relevance to the problem? everything is combinatorially explosive: the combination of words that we can string into sentences and essays is also combinatorially explosive and yet LLMs and humans have no problem with it. It's just the wrong frame of thinking for what's going on. These systems are obtaining higher and higher levels of abstractions because that is the most efficient thing for them to do to gain performance. That's what reasoning looks like: compositions of higher level abstractions. What you say may be true but I don't see how this is relevant.

"There is no sense in which they are mathematically destined to eventually program well"

- Yes there is and this belies and ignorance of the literature and how things work

- Again: RL has been around forever. Scaling laws have held empirically up to the largest scales we've tested. There are known RL scaling laws for both training and test time. It's ludicrous to state there is "no sense" in this, on the contrary, the burden of proof of this is squarely on yourself because this has already been studied and indeed is the primary reason why we're able to secure the eye-popping funding: contrary to popular HN belief, a trillion dollars of CapEx spend is based on rational evidence-based decision making.

> "How good are these types of algorithms at generalization"

There is a tremendously large literature and history of this. ULMFiT, BERT ==> NLP task generalization; https://arxiv.org/abs/2206.07682 ==> emergent capabilities, https://transformer-circuits.pub/2022/in-context-learning-an... ==> demonstrated circuits for in context learning as a mechanism for generalization, https://arxiv.org/abs/2408.10914 + https://arxiv.org/html/2409.04556v1 ==> code training produces downstream performance improvements on other tasks

> Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."

You say this and ignore my entire argument: you are right about all of your observations, yet

- Opus 4.6 compared to Sonnet 3.x is clearly more generalizable and less prone to these mistakes

- Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop and our recursive improvement loop will die off. Verifiable domains mean that we are in alphago land, we're learning by doing and not by mimicking human data or memorizing a training set.

sobellian 7 hours ago
Hey man, it sounds like you're getting frustrated. I'm not ignoring anything; let's have a reasonable discussion without calling each other ignorant. I don't dispute the value of these tools nor that they're improving. But the no free lunch theorem is inexorable so the question is where this improvement breaks down - before or beyond human performance on programming problems specifically.
What difference do I think there is between humans and an agent? They use different heuristics, clearly. Different heuristics are valuable on different search problems. It's really that simple.
To be clear, I'm not calling either superior. I use agents every day. But I have noticed that claude, a SOTA model, makes basic logic errors. Isn't that interesting? It has access to the complete compendium of human knowledge and can code all sorts of things in seconds that require my trawling through endless documentation. But sometimes it forgets that to do dirty tracking on a pure function's output, it needs to dirty-track the function's inputs.
It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?
- aspenmartin 7 hours ago
  
  What is unreasonable? I am saying the claims you are making are completely contradicted by the literature. I am calling you ignorant in the technical sense, not dumb or unintelligent, and I don't mean this as an insult. I am completely ignorant of many things, we all are.
  I am saying you are absolutely right that Opus 4.6 is both SOTA and also colossally terrible in even surprisingly mundane contexts. But that is just not relevant to the argument you are making which is that there is some fundamental limitation. There is of course always a fundamental limitation to everything, but what we're getting at is where that fundamental limitation is and we are not yet even beginning to see it. Combinatorics here is the wrong lens to look at this, because it's not doing a search over the full combinatoric space, as is the case with us. There are plenty of efficient search "heuristics" as you call them.
  > They use different heuristics, clearly.
  what is the evidence for this? I don't see that as true, take for instance: https://www.nature.com/articles/s42256-025-01072-0
  > It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?
  It's a long known problem with RL in a particular regime and isn't relevant to coding agents. Things like Nim are a small, adversarially structured task family and it's not representative of language / coding / real-world tasks. Nim is almost the worst possible case, the optimal optimal policy is a brittle, discontinuous function.
  Alphago is pure RL from scratch, this is quite challenging, inefficient, and unstable, and why we dont do that with LLMs, we pretrain them first. RL is not used to discover invariants (aspects of the problem that don't change when surface details change) from scratch in coding agents as they are in this example. Pretraining takes care of that and RL is used for refinement, so a completely different scenario where RL is well suited.
  
  1 reply →
troupo 7 hours ago
> So what do you think the difference is between humans and an agent in this respect?
Humans learn.
Agents regurgitate training data (and quality training data is increasingly hard to come by).
Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
> Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.
Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
- aspenmartin 7 hours ago
  
  > Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.
  I'm just going to ask that you read any of my other comments, this is not at all how coding agents work and seems to be the most common misunderstanding of HN users generally. It's tiring to refute it. RL in verifiable domains does not work like this.
  > Humans learn.
  Sigh, so do LLMs, in context.
  > Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.
  Literally benchmarks on this all over the place, I'm sure you follow them.
  > Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.
  and yet its not logarithmic? Consider data flywheel, consistent algorithmic improvements, synthetic data [basically: rejection sampling from a teacher model with a lot of test-time compute + high temperature],
  > For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.
  Benchmaxxing is for sure a real thing, not to mention even honest benchmarking is very difficult to do, but considering "all of the AI companies are just faking the performance data" to be the "story" is tremendously wrong. Consider AIME performance on 2025 (uncontaminated data), the fact that companies have a _deep incentive_ to genuinely improve their models (and then of course market it as hard as possible, thats a given). People will experiment with different models, and no benchmaxxing is going to fool people for very long.
  If you think Opus 4.6 compared to Sonnet 3.x is "little progress" I think we're beyond the point of logical argument.