Comment by cultofmetatron
11 hours ago
I seriously dont' know all this big hullabaloo about one shot prompting.
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
The streetlight effect:
> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"
All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.
Sure, for casual evaluation, I agree. But are there serious analyses that are evaluating this kind of thing? I mean, these are the kinds of things I evaluate in my own work when a new model comes out, or when I'm evaluating a harness. But this is all very ad hoc and intuitional. I'd love to start bringing rigor to it, but I haven't found much prior art on this. In another thread someone said that's because it's probably impossible to do this rigorously because too much of it is subjective. And that does match my intuition. But I continue to suspect that intuition is wrong.
It's hard to bring much rigor to it. I'm not saying impossible, but it's not like it's completely obvious how to do it and people are just too lazy. Intrinsically, if I'm going to test a back-and-forth with a model I have a human in the loop making frequent decisions. Did the model fail or succeed at whatever rate it did that because of the model or the human? Did the testing protocols capture the actual problem, e.g., maybe if the model was given some particular bit of information that a normal human would have given it it would have done much better or worse, but the testing protocol in the interests of "rigor" excluded the human in the loop from doing it. Is the human going to be willing to sit down and do the same task 25 times, refreshing the model from scratch each time for a "valid" test? Can you get the same human to analyze every model in the test? Is their 10th pass of the problem an invalid test because you can't as easily erase the human's knowledge of the previous 9 tests? What do you do with a model that succeeds wildly 75% of the time and spins off into a loop the other 25%? Is that loop real or, again, did your "rigorous" testing protocol prevent the human from saving the model from the loop like any developer would?
And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.
You usually see this kind of analyses in conference papers, esp. if they have a datasets track. The NeurIPS Datasets & Benchmarks (D&B) track is a good example. But you will have to monitor the proceedings yourself closely - there is little chance of being accidentally exposed to them, because most blogs, announcements and popular media only mention a handful of the popular ones, e.g., Tau^2. For ex., across the years 2022, 2023 and 2024, 900+ papers were accepted in the D&B track [1] - of course, not all of them are LLM-related. I find them interesting because they often focus on specific system behaviors, and like you said, study them scientifically, so you can draw authoritative conclusions (or at least know specifically what part of a model's behavior you now know about, and what parts you don't).
[1] https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-re...
DeepSWE is closer to that
https://deepswe.datacurve.ai/
The minute an open model breaks through and beats Claude Opus/Fable, it's over.
There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.
Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.
The world wants this.
I don’t think we need them. I think the models we have are good enough. It’s the orchestration layer that makes the biggest difference at this point. The open source models we have are capable of calling tools and the work is getting them to be capable enough to know which tools to call and what to do in response.
I think we are leaving the main frame era of AI and entering the PC era already. If there wasn’t a RAM shortage and we all had 2TB of ram and GPUs we would all have large local models or personal APIs serving our teams.
That’s why all the labs are moving to the App layer and moving away from being the API for intelligence like they were originally.
1 reply →
As crazy as this sounds, and as much I don't want to believe it myself, I think we're still underestimating LLMs, and we're gonna get to that point pretty soon.
The world does want this. Opus capabilities, in a box, securely tunneled to my family and I utilizing the resources I already have available to me which is, energy + network.
[dead]
[flagged]
This kind of hamfisted snark tends to make people take the actual and justified criticism of police less seriously.
2 replies →
It could be a taxi driver if you like. Or an anarchist passing by on xir way to a protest.
…in the US.
One-shot performance often translates to the most difficult problems a model will be able to understand. We run an evaluation that tests both agentic and one-shot performance, and we find that Chinese models are almost universally very good at using tools and a harness to iterate towards a better solution, whereas their initial response ranks relatively low.
Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.
Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.
The latest Anthropic and OpenAI models excel in both domains.
Data at https://gertlabs.com/rankings
IMHO, It's not the oneshotting.
It's the "starting from empty slate" greenfield that's the real problem.
We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.
One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.
There are upcoming benchmarks aimed at measuring the ability to work with brownfield tasks. (Of course, benchmarks can be gamed, but they are still better than unrealistic toy tasks that earlier generations of benchmarks used. Frontier labs are yet to use them in their tech reports or marketing material, though.:-)
* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823
At least they did some analysis. I've couple AI slop "X is the best tool for the job" that didn't even try it. (Worse, we are already using QT which has a tool for the job, and the QT tool works with the rest of the QT ecosystem unlike whatever AI told them)
It's a proxy for what you actually want to measure.
Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.
And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.
At least everyone knows what a 3D game is.
As someone who works at $corp - there is a massive different in tickets. I've seen "The is not spelled 'teh'", and I've seen some other service is writing to memory causing a crash in my service (the later took months to track down since our code was correct and nothing gives a hint of where to look). Both problems are important to fix, but the first is so simple I don't care how good AI is (the hard part is getting it through the process)
What are you yapping about? This was not one shot prompting, but a long run horizon task. But GLM and Opus invoked at least 120+ tools across the runs.
I guess the experiment is interesting to determine if a model can produce something subjectively valued as "good" based on fairly vague and open-ended specifications. The benchmark is not to determine if the output fits the input, but whether the output is internally consistent: it's a game, but does it behave as one would expect that any game behaves? Does it end when you each the goal, do you die when hitting the spikes, are there weird edge cases in behavior when you move around?
I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.
> I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
In fact, I'd rather see Anthropic publish a convincing project that does this using Claude. The project should be complex enough and novel enough to show the world how reliable and powerful Claude is. That is, Anthropic does not need Amodei or its employees to tell us that whatever percent of engineers will lose their jobs. They can just show us. Easily.
It's true that no one is trying to one shot anything serious right now, but it's still an important metric. Claude Code and Opus really took off when they improved the harnessing enough that it would self-correct many of its mistakes without needing user input. In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.
> In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.
Right, model intelligence defines the scope of things they can one shot
I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before
it wont happen, its all a money grab.
I think that LLMs will stay, but I also think we've plateaued and that big companies will fail and fall and we will have another years long "halt" of any real advancements coming to the public.
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
1 reply →
Unless I'm missing something, the prompt he gave must have been fairly detailed because both games are basically identical.
But for a more practical issue, the ultimate goal of LLMs is to replace software engineers, or at least enable everybody to become a software engineer, to use a more up-beat phrasing that's no less accurate. And so an LLM's ability to reliably construct something from a poorly defined, contradictory, or otherwise flawed prompt, while accurately inferring intent is probably the first finish line.
More likely is the models were trained on similar data.
I feel like on HN there is an endless cycle:
- Vibes are too subjective, I want an actual A/B test!
- An A/B test is too limited, I want a benchmark! (You are here.)
- Those benchmarks never seem to be reliable, I just go on vibes.
Exactly this. I recently tried Claude code again to get the subsidy on fable rather than paying api prices and was so frustrated by how much it pushed autonomous behavior. It would start ignoring my planning documents, ignoring my coding conventions, reimplementing features and code already in the project (not sure it ever makes sense to have two auth systems in parallel or two websocket implementations for the same ui) and then in the most shocking interaction just refused to stop working and listen to my instructions. I think maybe it was because there was a subagent doing the work but it was a complete waste of time and effort.
I was using cursor, in large part because I could at least stop it when I need to.
I ended up building my own IDE from scratch so I can be more in the loop while also having the full agent experience.
Isn't a plan file just a single long prompt?
> I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.
lets say you have a table that is partitioned. how do you lint/format "any select into this table MUST include the partition key in the predicate and any join must include it in the on." I'm not personally familiar with any static analysis tool that does this but its trivial to implement with an llm prompt. trivially easy to add to your automated PR reviews.
I would tell the LLM to write a custom rule/check for whatever the scenario is. Then when the CI gate is run, all my custom checks get deterministically run.
Elixir is where I prefer to build software, so it would be creating a custom Credo rule.
https://github.com/rrrene/credo
https://credo.hexdocs.pm/adding_checks.html
It's not always possible, or at least trivial. For example how do you enforce "prefer to reuse existing code over making a copy"? Is there a static analysis tool that will detect two pieces of code that do the same thing?
Yes it’s possible:
https://github.com/elixir-vibe/ex_dna
Wrong, custom "specs" i.e. schemas, are literally all we have for "real" guardrails with LLMs.
https://developers.openai.com/api/docs/guides/structured-out...
Nothing else operates on the logprobs level and literally bans continuations that fail your schema.
Enforcing structured outputs from LLMs is not the same thing as using linters, formatters, static analysis to control how an agent writes code.
2 replies →
> I seriously dont' know all this big hullabaloo about one shot prompting.
It's a relatively objective way of testing LLMs, and I think it's pretty representative of how strong models are overall.
The outcome of this test mirrors how GLM 5.2 and Opus 4.8 work for me: they're both similarly capable of fully executing a given task, but Opus tends to have a bit more "taste" in how it handles unstated details or implicit requirements.
> what you'll get is a series of assumptions made by the model
Yes, but that's why we use these models in the first place. We don't want to explicitly write down all the details because that would mean writing code. So we write a higher-level, human-language spec, and let the LLM fill in the blanks. The question is how good they are at doing that.
One shotting is useful to test but only with a huge prompt (eg, build something according to this spec).
I agree generating millions of tokens from a handful of input tokens doesn't convey anything meaningful to me.
If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention, we can pretty easily decide how well overall the model does. And, judging better models just means adding more requirements to a task. So, I think it's a useful method (Even if it's not a realistic use case).
Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)
You seem to be missing the point of what parent is saying :)
To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.
None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.
> If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention (...)
This is the wrong metric to target. Today's models can feel one-shot but they are so at the expense of resilient ReAct loops that brute force their way out of the mess initial prompts created.
And each iteration is expensive.
Sometimes failing fast and early is better than going for one-shot models that try to mitigate the mess they created with reasoning steps and ReAct loops.
I think you're underestimating the elegance of "hey build X". It already captures a lot of what you're interested in.
Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.
Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.
Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.
On one hand, that's sort of true for practical uses - and benchmarks notoriously undercount multi-turn settings.
On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.
When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.
The argument is flawed, there is no logical reason to assume a single prompt won’t be sufficient to constitute the complexity of a software project. It may not be practical in many cases but there is too much variability in what is considered a complex software project and in the sufficiency of instruction in a single prompt to make that claim and say it’s “by definition.”
And that prompt will basically be 2000 page spec Bible à la IBM circa 1960, see waterfall. Unless AI develops mindreading (and advanced mindreading at that), single prompt creation of actual complex software products will never happen. You'll one shot a simple non scientific calculator, but not Excel or Vim or Nginx.
Why not? Given a proper spec, you should absolutely be able to one-shot Excel, particularly if we put it at the level of complexity of, say, Excel 1.0 for Mac.
Current models aren't capable of that, but that doesn't mean it's not possible.
4 replies →
One shot prompting/tooling is the only reasonable way to use an llm in my opinion. You should not be having an LLM operating for hours creating thousands of lines of new code that you can never review or maintain. You can actually be highly productive modifying a single file or two at a time, ideally as focused and little context as possible, without the llm being given full permission to add as much context as possible along the way to maximize revenue for the developers of the harness.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
I disagree fundamentally.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
8 replies →
> a single prompt wont' constitute the complexity of a software project.
The top agent is for steering, but all subagents are mostly oneshot prompts
I also love the term zero-shot in the AI benchmark world. So logical. So intuitive.........
"We did multi-shot prompting to try and get these two games into comparable states using these two different models."
"Well obviously you provided better follow-up prompts to the one that came out better."
Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?
Sure, real-world usage is always more difficult to benchmark, but the additional issue with the one shot prompting benchmark is that by optimizing for it, models are nudged towards making all those assumptions they shouldn't really make. Maybe a better test would be to have a fully spec'd-out plan, but start with a one shot, high-level prompt and expect the agent to discover your preferences by repeatedly asking for clarifications. The system that manages to suss out more of the details in the hidden spec this way, in less steps and with less unnecessary questions would more likely to be a truly well-calibrated agent.
Blame anthropic, they decided to make these type of one-shot examples the primary focus of the Fable 5 release, and relegating benchmark scores to the pdf.
PREACH. I have no idea why THIS has become the standard for illustrating model capabilities. It's endlessly frustrating when that was the initial objective for all these models, but, became increasingly clear over time that none of these models were ever capable of getting the desired output for complex software on the initial prompt.
The reality is: - business rules change - ideas for improvement may arise from the initial prompt - updates to submodules/functions/configs/secrets are BLOCKERS ... etc.
One shot prompting for the expecations of complete software is seemingly more and more a show of incompetence of the use of this technology. It's like trying to make my toddler eat a ham sandwich from the peanut butter & jelly I put in front of him.
That's precisely the difference between an engineer and a business guy.
The business guy would say "hey build me this and that" and would get _something_ to show of.
An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.
It won't be as fast as "build this", but the result will be much better and more maintainable.
For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.
what you said above is only true when the AI is not as smart/professional/knowledgable as that engineer.
Of course it's not. Otherwise, before telling it "do this app, make no mistakes" you would need to feed an AI with the complete relevant knowledge, history, and constraints, and then your prompt wouldn't be a one-liner, but a 3000-page document.
And yet, even the smartest AI in the world would give an alternative solution every time you invoke it. And you still need someone to judge what is right and what is not.
Single prompt performance is interesting because best agentic results of yesterday turned out to be best single prompt results of today.
If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.
Of course this particular benchmark is not really single prompt but rather "agentic without steering".
I think that’s the point of the Superpowers SKILL
The thing with one-shot prompting is that it tests the ability for the model to make good choices on its own, rather than only instruction following.
Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).
For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.
That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.
For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.
That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.
Yet this is how virtually everybody is benchmarking and fine tuning.
Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.
It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.
I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.