Comment by aspenmartin

11 hours ago

I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications

- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)

- coding is a verifiable domain

The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.

Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).

Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.

31 comments

aspenmartin

sobellian 11 hours ago

Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.

Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.

aspenmartin 10 hours ago
Combinatorial explosion? What do you mean? Again, your experiences are true, but they are improving with each release. The error rate on tasks continues to go down, even novel tasks (as far as we can measure them). Again this is where verifiable domains come in -- whatever problems you can specify the model will improve on them, and this improvement will result in better generalization, and improvements on unseen tasks. This is what I mean by taking your observations of today, ignoring the rate of progress that got us here and the known scaling laws, and then just asserting there will be some fundamental limitation. My point is while this idea may be common, it is not at all supported by literature and the mathematics.
- sobellian 10 hours ago
  
  The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.
  How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?
  Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."
  
  6 replies →

embedding-shape 11 hours ago

> - coding is a verifiable domain

You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.

How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.

jorl17 9 hours ago
While I agree we don't have any methodologies for this, it's also true that we can just "fail" more often.
Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I wouldn't be surprised if in a couple of years we see several projects that approach the problem of tech debt like this:
1. Instruct AI to write tens of thousands of tests by using available information, documentation, requirements, meeting transcripts, etc. These tests MUST include performance AND availability related tests (along with other "quality attribute" concerns) 2. Have humans verify (to the best of their ability) that the tests are correct -- step likely optional 3. Ask another AI to re-implement the project while matching the tests
It sounds insane, but...not so insane if you think we will soon have models better than Opus 4.6. And given the things I've personally done with it, I find it less insane as the days go by.
I do agree with the original poster who said that software is moving in this direction, where super fast iteration happens and non-developers can get features to at least be a demo in front of them fast. I think it clearly is and am working internally to make this a reality. You submit a feature request and eventually a live demo is ready for you, deployed in isolation at some internal server, proxied appropriately if you need a URL, and ready for you to give feedback and have the AI iterate on it. Works for the kind of projects we have, and, though I get it might be trickier for much larger systems, I'm sure everyone will find a way.
For now, we still need engineers to help drive many decisions, and I think that'll still be the case.These days all I do when "coding" is talking (via TTS) with Opus 4.6 and iterating on several plans until we get the right one, and I can't wait to see how much better this workflow will be with smarter and faster models.
I'm personally trying to adapt everything in our company to have agents work with our code in the most frictionless way we can think of.
Nonetheless, I do think engineers with a product inclination are better off than those who are mostly all about coding and building systems. To me, it has never felt so magical to build a product, and I'm loving it.
- embedding-shape 9 hours ago
  
  > Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
  I'm sorry, but only someone who never maintained software long-term would say something like this. The further along you are in development, the magnitude of costs related to changing that increases, maybe even exponentially.
  Correct the design before you even wrote code, might be 100x cheaper (or even 1000x) than changing that design 2 years later, after you've stored TBs of data in some format because of that decision, and lots of other parts of the company/product/project depends on those choices you made earlier.
  You can't just pile on code on top of code, say "code is cheap" and hope for the best, it's just not feasible to run a project long-term that way, and I think if you had the experience of maintaining something long-term, you'd realize how this sounds.
  The easiest part of "software engineering" is "writing code", and today "writing code" is even easier. But the hardest parts, actually designing, thinking and maintaining, remains the same as before, although some parts are easier, others are harder.
  Don't get me wrong, I'm on the "agentic coding" train as much as everyone else, probably haven't written/edited a code by myself for a year at this point, but it's important to be realistic about what it actually takes to produce "worthwhile software", not just slop out patchy and hacky code.
  
  1 reply →
aspenmartin 11 hours ago
because it has business context and better reasoning, and can ask humans for clarification and take direction.
You don't need to benchmark this, although it's important. We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
I do benchmarks for a living and can attest: benchmarks are bad, but it doesn't matter for the point I'm trying to make.
- embedding-shape 10 hours ago
  
  I feel like you're missing the initial context of this conversation (no pun intended):
  > Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent.
  Once you add "humans for clarifications and take direction" then yeah, things can be useful, but that's far away from the non-human-involvment-loop earlier described in this thread, which is what people are pushing back against.
  Of course, involving people makes things better, that's the entire point here, and that by removing the human, you won't get as good results. Going back to benchmarks, obviously involving humans aren't possible here, so again we're back to being unable to score these processes at all.
  
  1 reply →
- troupo 10 hours ago
  
  > because it has business context
  It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
  > better reasoning
  It doesn't have better reasoning beyond very localized decisions.
  > and can ask humans for clarification and take direction.
  And yet it doesn't, no matter how many .md file you throw at it, at crucial places in code.
  > We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
  This is just a bunch of words stringed together, isn't it?
  
  8 replies →

nprateem 11 hours ago

But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.

The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy

aspenmartin 11 hours ago
Why do you think this is a problem? Reasoning is constantly improving, it has ample access to humans to gather more business context, it has access to the same industry data and other signals that humans do, and it can get any data necessary. It has Zoom meeting notes, I mean why do people think there's somehow a fundamental limit beyond coding?
The other thing you're missing here is generalizability. Better coding performance (which is verifiable and not limited by human data quality) generalizes performance on other benchmarks. This is a long known phenomenon.
- skydhash 9 hours ago
  
  > Why do you think this is a problem?
  Because it cannot do it?
  Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).
  You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.
  
  3 replies →