Comment by trq_
13 hours ago
Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.
13 hours ago
Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.
Can't you keep the model the same, until the user chooses to use a different model?
He said it was the harness, not the model though.
Thank you. Fair enough