Comment by rdsubhas

8 hours ago

IMHO, It's not the oneshotting.

It's the "starting from empty slate" greenfield that's the real problem.

We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.

One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.

I think this (for me at least) is the biggest pain point. Use styles and practices from this existing code base, even if they aren't documented explicitly in AGENTS.md or something. If we're importing a library somewhere that does what the agent is doing, reuse that same library - don't chose another one. If we have a pattern for unit tests, follow the same style. Etc. etc.

That issue, and the issue of "aesthetics", are the biggest complaints I have today. I don't know exactly how to define aesthetics, but it's when AI is making decisions that no experienced developer or designer would. They may be functionally correct but "ugly" to another developer or and end user.

An example is an case I ran in to yesterday where parsing a config, and failing and logging on a configuration error. It logged a specific item where the config was invalid but not what group or any notion of where in the config this error was. Of course, specific item names could be duplicated in different parts of the config. It's small, but correcting these minor things take time and they are the types of decisions no one would have made who had any experience writing code and debugging a config problem. This was Opus 4.8/max too.

There are upcoming benchmarks aimed at measuring the ability to work with brownfield tasks. (Of course, benchmarks can be gamed, but they are still better than unrealistic toy tasks that earlier generations of benchmarks used. Frontier labs are yet to use them in their tech reports or marketing material, though.:-)

* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823

At least they did some analysis. I've couple AI slop "X is the best tool for the job" that didn't even try it. (Worse, we are already using QT which has a tool for the job, and the QT tool works with the rest of the QT ecosystem unlike whatever AI told them)