Comment by mohsen1

6 months ago

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

15 comments

mohsen1

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

Dolores12 6 months ago
so basically they know requests using your API key should be treated with care?
- swyx 6 months ago
  
  they could but you can also have some trust in anthropic to have some integrity there, these are earnest people.
  "trust but verify" ofc . https://latent.space/p/artificialanalysis do api keys but also mystery shopper checks
  
  4 replies →
- Deklomalo 6 months ago
  
  [dead]
epolanski 6 months ago
The last thing a proper benchmark should do is reveal it's own API key.
- plagiarist 6 months ago
  
  IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
  
  3 replies →
- sejje 6 months ago
  
  That's a good thought I hadn't had, actually.
mohsen1 6 months ago

yes I reached out to them but as you say it's a chicken-and-egg problem.
Thanks!