Comment by notnullorvoid
5 months ago
It's a very flawed test.
> We sourced real tasks that were previously solved by paid contributors.
It seems possible/likely the answers would in the training data (time dependant, maybe some were answered post training, but pre benchmark).
They do address the potential for contamination in the paper fwiw:
> Note that Table 4 in Appendix A2 shows no clear performance improve-ment for tasks predating the models’ knowledge cutoffs, suggesting limited impact of contamination for those tasks.