Comment by impulser_
3 days ago
Because there is literally nothing special about coding hardnesses. The models are doing all the lifting. It just user experience that separates them.
A coding hardness with just bash outperforms Codex, Claude Code, OpenCode, Pi ect. The added features are just user experience features.
If harnesses are basically doing nothing, why would these metrics vary so widely?
https://www.endorlabs.com/research/ai-code-security-benchmar...
There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.
It's because they do things that is why they score differently. Coding hardness add features for user experience not for agent efficiency. If they did all the coding hardnesses would be using bash and code mode and letting the agents write code to perform tasks but this doesn't work because you want humans in the loop. You want users to be able to approve and deny writes. You want uses to see edits. So you have to build tool for these. It's hard to show diffs when the agent is just using bash.
> The added features are just user experience features.
> It's because they do things that is why they score differently.
That was my point. Regardless of how you feel about UX, it's a value added set of features. The question initially posited, stands. Why would a company do any of these things?
> Coding hardness add features for user experience not for agent efficiency.
Pretending it was always about some metric you just decided was important is moving the goalpost. It's not compelling.
I think it makes more sense that it's Freemium Dominance or they act as Low-Cost Marketing tools.
A harness(notice the lack of a 'd') is a strap system to gain control over something.
Like the thing people attach a dog lead to so that their kids won't just go kamikaze into a car.
Coding harnesses are named by analogy to that.
They are not hard.
The reason I have a dog harness is to distributes weight so I don't choke her when she goes at the other dog that she doesn't like. I'm actually puzzling over kids kamikazeing into cars
It's actually only a problem if it's the other way around, isn't it?
If kids run into a car, they will most probably just bounce and continue, perhaps inflicting some minor damage. But if a car mows down a kid, that could well be a fatal injury. Leashes for all the cars! ;)
1 reply →
It is a common fear for parents. Obviously they are not fighting for the emperor but chasing or running away from something.
The strapped kids are often normal with no apparent disabilities(but it is possible they have an ADHD diagnosis).
Never thought about doing it to my own.
You got to miss spell these days or people assume your ai :)
That's very punnyy
Its like yuo're on fire!
Try Kimi in Kimi CLI and Claude Code and try saying that again. Kimi quickly collapses into tool calling loops without measures in their CLI but not in Claude Code and is largely useless for any long running tasks in harnesses not taking this into account.
With those measures (which are actually quite interesting) it can at times perform at Sonnet level.
I would disagree here.
Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.
Tools must be specification driven to reduce noise and high temp hallucinations, tool call shrinking needs to remove errors and tryouts of different formats of parameters (because LLMs always ignore descriptions in the JSON...), and you have to deal with long running agents because you can't afford them. Planner/orchestrator architecture, agent to agent communication need to be summarized, and then you have the messed up scheduling parts, because you need to prioritize short running agents and give the planner a tool to wait for outputs of spawned contractor agents.
And that's not even talking about sandbox vs playground read/write/access policies of tools.
Harness engineering, if done correctly, is quite hard.
And all of this works 60% of the time, every time.
Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.
In my limited experience, the smaller the model, the bigger the harness. Where with something like claude or deepseek the context size etc just let's you give it bash access and step back; small models tends to do better with simple action - response , new context each call. Context management becomes a continuous activity. Its a fun space , and I have found big models decent at building and improving these harnesses for the small ones. Using /loop and just run a continuous test - build - test loop.
Your reply doesn't answer the question: What is their motivation for any of it?