Comment by ofirpress
20 hours ago
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
20 hours ago
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
so basically they know requests using your API key should be treated with care?
they could but you can also have some trust in anthropic to have some integrity there, these are earnest people.
"trust but verify" ofc . https://latent.space/p/artificialanalysis do api keys but also mystery shopper checks
That's why we're setting up adversarial benchmarks to test if they are doing the thing they promised not to do, because we totally trust them.
> these are earnest people.
I agree.
I'll also add that when my startup got acquired into a very large, well-known valley giant with a sterling rep for integrity and I ended up as a senior executive - over time I got a first-hand education on the myriad ways genuinely well-intentioned people can still end up being the responsible party(s) presiding over a system doing net-wrong things. All with no individual ever meaning to or even consciously knowing.
It's hard to explain and I probably wouldn't have believed myself before I saw and experienced it. Standing against an overwhelming organizational tide is stressful and never leads to popularity or promotion. I think I probably managed to move on before directly compromising myself but preventing that required constant vigilance and led to some inter-personal and 'official' friction. And, frankly, I'm not really sure. It's entirely possible I bear direct moral responsibility for a few things I believe no good person would do as an exec in a good company.
That's the key take-away which took me a while to process and internalize. In a genuinely good organization with genuinely good people, it's not "good people get pressured by constraints and tempted by extreme incentives, then eventually slip". I still talk with friends who are senior execs there and sometimes they want to talk about whether something is net good or bad. I kind of dread the conversation going there because it's inevitably incredibly complex and confusing. Philosopher's trolley car ethics puzzles pale next to these multi-layered, messy conundrums. But who else are they going to vent to who might understand? To be clear, I still believe that company and its leadership to be one of the most moral, ethical and well-intentioned in the valley. I was fortunate to experience the best case scenario.
Bottom line: if you believe earnest, good people being in charge is a reliable defense against the organization doing systemically net-wrong things - you don't comprehend the totality of the threat environment. And that's okay. Honestly, you're lucky. Because the reality is infinitely more ambiguously amoral than white hats vs black hats - at the end of the day the best the 'very good people' can manage is some shade of middle gray. The saddest part is that good people still care, so they want to check the shade of their hat but no one can see if it's light enough to at least tell yourself "I did good today."
2 replies →
[dead]
The last thing a proper benchmark should do is reveal it's own API key.
That's a good thought I hadn't had, actually.
IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
With the insane valuations and actual revenue at stake, benchmarkers should assume they're assessing in an adversarial environment. Whether from intentional gaming, training to the test, or simply from prioritizing things likely to make results look better, targeting benchmarks will almost certainly happen.
We already know large graphics card manufacturers tuned their drivers to recognize specific gaming benchmarks. Then when that was busted, they implemented detecting benchmarking-like behavior. And the money at stake in consumer gaming was comparatively tiny compared to current AI valuations. The cat-and-mouse cycle of measure vs counter-measure won't stop and should be a standard part of developing and administering benchmark services.
Beyond hardening against adversarial gaming, benchmarkers bear a longer term burden too. Per Goodhart's Law, it's inevitable good benchmarks will become targets. The challenge is the industry will increasingly target performing well on leading benchmarks, both because it drives revenue but also because it's far clearer than trying to glean from imprecise surveys and fuzzy metrics what helps average users most. To the extent benchmarks become a proxy for reality, they'll bear the burden of continuously re-calibrating their workloads to accurately reflect reality as user's needs evolve.
But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context
1 reply →
yes I reached out to them but as you say it's a chicken-and-egg problem.
Thanks!