Comment by smusamashah

1 year ago

    We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

5 comments

smusamashah

attentive 1 year ago

"We wished to also evaluate Claude 3.7 Sonnet, but were unable to complete the experiments given rate limits with the Anthropic API."

swyx 1 year ago
overall i REALLY like this paper and effort, but this part sounds like a bit of bullshit. they dont have the ability to implement retries and backoffs to deal with rate limits?
- eightysixfour 1 year ago
  
  Because they used wall clock time, not compute time, flops, or watts, to standardize. 24 hours and 36 hours of compute.
  They could build a system which gives them equal compute time by ignoring time spent rate limiting and such, but they chose not to.
  
  1 reply →
- moralestapia 1 year ago
  
  "Why don't they just break the TOS"
  Damned if you do, damned if you don't.