Comment by ijk
14 hours ago
> 2. Remembering they are supposed to write tests and keep ALL of them green (just like our human juniors...)
I think the core principle that everyone is forgetting is that your evaluation metric must be kept separate from your optimization metric.
In most setups I've seen, there isn't much emphasis on adding scripting that's external to the LLM, but in my experience having that verification outside of the LLM loop is critical to avoid it cheating. It won't intend to cheat, insofar as it has any intent at all, but you're giving it a boatload of optimization functions to balance and it's prone to randomly dropping one at the worst time. And to be fair, falling flat on its face to win the race [1] is often the implicit conclusion of what we told it to do without realizing the consequences.
If you need something to happen every time, particularly as part of the validation, it is better to have an automated script as part of the process, rather than trying to pile on one more instruction.
No comments yet
Contribute on Hacker News ↗