Comment by gcr

21 days ago

how do you know your harness design isn’t just overfitting on your test set?

1 comment

gcr

Love this question! A few points:

- First, there's totally a "risk" there. I built both the harnesses and the eval suite and that's hardly a double-blind study. There's no world where some bias doesn't leak through.

- I did try to design the guardrails to be domain-agnostic so they aren't tuned to specific scenario failures and return generic nudges to the LLM.

- Most tactically, the guardrails were built on the first 18 scenarios (OG-18) published in the paper, and only after did I had 8 more advanced reasoning ones. I didn't update the guardrails when I added those, and the lift was still there. If they were overtuned, they wouldn't have the same level of impact on an newer set.

- I did dogfood forge post publication using several unrelated consumers and the features I baked in were rarely guardrail related. If they were, it was more model focused (ie, xml-parse-rescue for granite models).

But at the end of the day, there's an explicit connection between the guardrail author and eval author. Happy to take contributions of eval scenarios if you want to stress test things, or hear about your experience running a completely different consumer!