Comment by cookiengineer
4 days ago
I would disagree here.
Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.
Tools must be specification driven to reduce noise and high temp hallucinations, tool call shrinking needs to remove errors and tryouts of different formats of parameters (because LLMs always ignore descriptions in the JSON...), and you have to deal with long running agents because you can't afford them. Planner/orchestrator architecture, agent to agent communication need to be summarized, and then you have the messed up scheduling parts, because you need to prioritize short running agents and give the planner a tool to wait for outputs of spawned contractor agents.
And that's not even talking about sandbox vs playground read/write/access policies of tools.
Harness engineering, if done correctly, is quite hard.
And all of this works 60% of the time, every time.
Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.
In my limited experience, the smaller the model, the bigger the harness. Where with something like claude or deepseek the context size etc just let's you give it bash access and step back; small models tends to do better with simple action - response , new context each call. Context management becomes a continuous activity. Its a fun space , and I have found big models decent at building and improving these harnesses for the small ones. Using /loop and just run a continuous test - build - test loop.