Comment by raincole
7 days ago
I don't know the details (of course, it's unreleased), but note that MathArena evaluated "average of 4 attempts", and limited token usages to 64k.
OpenAI likely had unlimited tokens, and evaluated "best of N attempts."
No comments yet
Contribute on Hacker News ↗