Comment by lwansbrough
3 hours ago
Had a couple thoughts in this realm, and am working them into my own harness. Curious to see what others think. I'm not sure if this is generalizable, as my harness is fairly specialized:
- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.
- The harness then executes the plan in order
- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.
- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.
- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.
- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.
This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.
I went through a similar (in philosophy) exercise with my small-model agentic coding harness - built on forge.
A few things I noticed related to your points: - on conversation rewind, I implemented a similar tool call collapse on the main agent (the one you chat with). Once it was done with a task, the tool call history was collapsed to keep the context clean - it was more about hygiene than size.
- the harness interrogating the model bit is a bit different, I haven't tried that approach. Forge relies on model self-correction in a bid to avoid having bespoke error modes, but I guess if you can abstract and automate the interrogation based on schema or something that could work!
Overall I like the clean conversation history aspect, but I suspect that you might be doing a lot of round trips for tools with many args, versus "letting it fail and giving it one nudge". That being said, it's an interesting idea for harder scenarios/tasks!
I've been writing my own, out of curiosity, with gemma4. I've been surprised how far I'm getting.
Very cool! Hopefully you'll share it someday!