Comment by yogthos
7 hours ago
I haven't really seen anybody come up with a good test to show hard numbers on comparing agentic harnesses. It's a bit tricky to set up a definitive test given the whole non deterministic nature of LLMs. What I've been focusing on is watching the loop and seeing where model does things that it shouldn't have to. For example, I notice models doing stuff like writing python scripts to match parens for Clojure all the time using editors like Pi. So, having a mechanical way to repair parens, and when that fails, to give the model clear error regarding where syntax is broken removes that whole cycle.
As it stands, it's kind of subjective, you just have to try the harness and see if the model seems to be have better than with the other ones you've been using.
How are you iterating on a system prompt and tool descriptions without an eval that gives you hard numbers for improvement or regression?
I look at what the model is doing in the loop and whether the harness is catching cases such as the model having to write scripts to balance parens, whether it's trying to do the same thing over and over again, and all the other cases I explained in detail in the blog post.
Even without having hard numbers, it's pretty easy to see from the log whether the model is getting stuck or not.