Comment by brookst
1 month ago
A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.
1 month ago
A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.
No comments yet
Contribute on Hacker News ↗