Comment by SlinkyOnStairs
9 hours ago
> hopefully changes the way benchmarking is done
The purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
I work at OpenAI and I really don't find this to be the case.
We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.
There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.
I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?
>The purpose of a system is what it does.
I am so tired of this saying.
It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.
Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.
I think the point is that if the side effects become known and are accepted, or if they are known and rejected, then indeed the purpose of the system is what it does.
https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...
You are misunderstanding the saying. It is entirely about unintended consequences and viewing the system for what it actually does and not any stated intentions of the designers.
I will propose that you are wrong.
1. We must ignore the intentions of the designers (your claim), and instead see what the outcomes are
2. Therefore we should ignore Beer's intentions when designing the phrase POSWID, and instead see how it is used.
3. The overwhelming majority of people using it on the internet (including the GP comment) is to imply that the people perpetuating the system actually desire the outcome.
So the purpose of POSWID is clearly to imply intent.
Well that’s stupid and completely ignores the meaning of the word “purpose”.
Same. Anyone who has designed anything at all in any domain realizes that what your intentions are and what materializes are often not the same. You have practical constraints in the real world. That doesn’t somehow make the constraints the purpose. The saying makes no sense.
That is Anthropic’s shtick to a tee.