Comment by mohsen1
18 hours ago
Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
18 hours ago
Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
so basically they know requests using your API key should be treated with care?
they could but you can also have some trust in anthropic to have some integrity there, these are earnest people.
"trust but verify" ofc . https://latent.space/p/artificialanalysis do api keys but also mystery shopper checks
4 replies →
[dead]
The last thing a proper benchmark should do is reveal it's own API key.
That's a good thought I hadn't had, actually.
IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
3 replies →
yes I reached out to them but as you say it's a chicken-and-egg problem.
Thanks!