Comment by mg

12 hours ago

Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.

15 comments

hypfer 12 hours ago

This is an absolutely crazy wasteful thing to do considering the actual cost of all that inference and nothing to be proud of.

loehnsberg 11 hours ago

Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.
mg 11 hours ago
It is the other way round.
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
- jgilias 11 hours ago
  
  The cost is far from linear though. Because of prompt caching and the fact that generally output tokens are a lot more expensive than input tokens.
  
  1 reply →
- ryan_glass 11 hours ago
  
  "Make the button red" probably doesn't need an LLM at all.
  
  1 reply →
- Chirono 11 hours ago
  
  That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
redox99 11 hours ago
Probably like 1% of the energy an average person spends on driving.
- Raphael_Amiard 11 hours ago
  
  Average american is what you mean
cactusplant7374 6 hours ago

The cost is nothing compared to the outcome and time savings. What I see is that people with no money want to jump into this pool but they aren't having a good time. That is generally the case when you are poor.
cyanydeez 11 hours ago

come on now, we can't just not escape the permanent underclass by using our brains, we've also got to use up all the resources while doing it.

justinclift 8 hours ago

What kind of projects/code do you have them work on?

Asking because I could guess that approach would be ok for the types of front end work that doesn't require much security or other validation.

But it sounds like it wouldn't be suitable for work in regulated industries or anything that needs to have extreme care taken.

perching_aix 10 hours ago

Which model is leading the pack for you?

mg 10 hours ago

From the SOTA model providers, I only use OpenAI and Google. And between gpt-5.5 and gemini-3.1-pro-preview, gpt-5.5 is currently leading.