Comment by quantumleaper

7 hours ago

How are you iterating on a system prompt and tool descriptions without an eval that gives you hard numbers for improvement or regression?

I look at what the model is doing in the loop and whether the harness is catching cases such as the model having to write scripts to balance parens, whether it's trying to do the same thing over and over again, and all the other cases I explained in detail in the blog post.

Even without having hard numbers, it's pretty easy to see from the log whether the model is getting stuck or not.