Comment by embedding-shape

11 hours ago

> - coding is a verifiable domain

You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.

How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.

15 comments

embedding-shape

jorl17 9 hours ago

While I agree we don't have any methodologies for this, it's also true that we can just "fail" more often.

Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.

I wouldn't be surprised if in a couple of years we see several projects that approach the problem of tech debt like this:

1. Instruct AI to write tens of thousands of tests by using available information, documentation, requirements, meeting transcripts, etc. These tests MUST include performance AND availability related tests (along with other "quality attribute" concerns) 2. Have humans verify (to the best of their ability) that the tests are correct -- step likely optional 3. Ask another AI to re-implement the project while matching the tests

It sounds insane, but...not so insane if you think we will soon have models better than Opus 4.6. And given the things I've personally done with it, I find it less insane as the days go by.

I do agree with the original poster who said that software is moving in this direction, where super fast iteration happens and non-developers can get features to at least be a demo in front of them fast. I think it clearly is and am working internally to make this a reality. You submit a feature request and eventually a live demo is ready for you, deployed in isolation at some internal server, proxied appropriately if you need a URL, and ready for you to give feedback and have the AI iterate on it. Works for the kind of projects we have, and, though I get it might be trickier for much larger systems, I'm sure everyone will find a way.

For now, we still need engineers to help drive many decisions, and I think that'll still be the case.These days all I do when "coding" is talking (via TTS) with Opus 4.6 and iterating on several plans until we get the right one, and I can't wait to see how much better this workflow will be with smarter and faster models.

I'm personally trying to adapt everything in our company to have agents work with our code in the most frictionless way we can think of.

Nonetheless, I do think engineers with a product inclination are better off than those who are mostly all about coding and building systems. To me, it has never felt so magical to build a product, and I'm loving it.

embedding-shape 9 hours ago
> Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.
I'm sorry, but only someone who never maintained software long-term would say something like this. The further along you are in development, the magnitude of costs related to changing that increases, maybe even exponentially.
Correct the design before you even wrote code, might be 100x cheaper (or even 1000x) than changing that design 2 years later, after you've stored TBs of data in some format because of that decision, and lots of other parts of the company/product/project depends on those choices you made earlier.
You can't just pile on code on top of code, say "code is cheap" and hope for the best, it's just not feasible to run a project long-term that way, and I think if you had the experience of maintaining something long-term, you'd realize how this sounds.
The easiest part of "software engineering" is "writing code", and today "writing code" is even easier. But the hardest parts, actually designing, thinking and maintaining, remains the same as before, although some parts are easier, others are harder.
Don't get me wrong, I'm on the "agentic coding" train as much as everyone else, probably haven't written/edited a code by myself for a year at this point, but it's important to be realistic about what it actually takes to produce "worthwhile software", not just slop out patchy and hacky code.
- wyre 6 hours ago
  
  I've never maintained software long-term so i could be wrong, but I interpret "code is cheap" to mean that you can have coding agents refactor or rewrite the project from scratch around the design correction. I don't think 'code is cheap' ever should be interpreted to mean ship hacky code.
  I think using agents to prototype code and design will be a big thing. Have the agent write out what you want, come back with what works and what doesn't, write a new spec, toss out the old code and and have a fresh agent start again. Spec-driven development is the new hotness, but we know that the best spec is code, have the agent write the spec in code, rewrite the spec in natural language, then iterate.

aspenmartin 11 hours ago

because it has business context and better reasoning, and can ask humans for clarification and take direction.

You don't need to benchmark this, although it's important. We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.

I do benchmarks for a living and can attest: benchmarks are bad, but it doesn't matter for the point I'm trying to make.

embedding-shape 10 hours ago
I feel like you're missing the initial context of this conversation (no pun intended):
> Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent.
Once you add "humans for clarifications and take direction" then yeah, things can be useful, but that's far away from the non-human-involvment-loop earlier described in this thread, which is what people are pushing back against.
Of course, involving people makes things better, that's the entire point here, and that by removing the human, you won't get as good results. Going back to benchmarks, obviously involving humans aren't possible here, so again we're back to being unable to score these processes at all.
- aspenmartin 9 hours ago
  
  I'm confused on the scenario here. There is human in the loop, it's the feedback part...there is business context, it is either seeded or maintained by the human and expanded by the agent. The agent can make inferences about the world, especially when embodiment + better multimodal interaction is rolled out [embodiment taking longer].
  Benchmarks ==> it's absolutely not a given that humans can't be involved in the loop of performance measurement. Why would that be the case?
troupo 10 hours ago
> because it has business context
It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
> better reasoning
It doesn't have better reasoning beyond very localized decisions.
> and can ask humans for clarification and take direction.
And yet it doesn't, no matter how many .md file you throw at it, at crucial places in code.
> We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.
This is just a bunch of words stringed together, isn't it?
- aspenmartin 9 hours ago
  
  > It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context
  It does learn in context. And lack of continuous learning is temporary, that is a quirk of the current stack, expect this to change rather quickly. Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able to grok codebases or do internal searches effectively and this will only improve.
  > It doesn't have better reasoning beyond very localized decisions.
  Do you have any basis for this claim? It contradicts a large amount of direct evidence and measurement and theory.
  > This is just a bunch of words stringed together, isn't it?
  Maybe to yourself? Chinchilla scaling laws and RL scaling laws are measured very accurately based on next token test loss (Chinchilla). This scales very predictably. It is related to downstream performance, but that relationship is noisy but clearly monotonic
  
  5 replies →
- skydhash 9 hours ago
  
  Almost every task that people are tackling agents on, it’s either not worth doing, can be done better with scripts and software, or require human oversight (that negates all the advantages.
  
  1 reply →