← Back to context

Comment by NichoPaolucci

11 hours ago

If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention, we can pretty easily decide how well overall the model does. And, judging better models just means adding more requirements to a task. So, I think it's a useful method (Even if it's not a realistic use case).

Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)

You seem to be missing the point of what parent is saying :)

To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.

None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.

> If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention (...)

This is the wrong metric to target. Today's models can feel one-shot but they are so at the expense of resilient ReAct loops that brute force their way out of the mess initial prompts created.

And each iteration is expensive.

Sometimes failing fast and early is better than going for one-shot models that try to mitigate the mess they created with reasoning steps and ReAct loops.