Comment by notnullorvoid

1 year ago

It's a very flawed test.

> We sourced real tasks that were previously solved by paid contributors.

It seems possible/likely the answers would in the training data (time dependant, maybe some were answered post training, but pre benchmark).

1 comment

notnullorvoid

throwaway0123_5 1 year ago

They do address the potential for contamination in the paper fwiw:

> Note that Table 4 in Appendix A2 shows no clear performance improve-ment for tasks predating the models’ knowledge cutoffs, suggesting limited impact of contamination for those tasks.