← Back to context

Comment by notnullorvoid

5 months ago

It's a very flawed test.

> We sourced real tasks that were previously solved by paid contributors.

It seems possible/likely the answers would in the training data (time dependant, maybe some were answered post training, but pre benchmark).

They do address the potential for contamination in the paper fwiw:

> Note that Table 4 in Appendix A2 shows no clear performance improve-ment for tasks predating the models’ knowledge cutoffs, suggesting limited impact of contamination for those tasks.