Comment by runako

5 months ago

It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).

Does this work as an experiment if the questions under test were also used to train the LLMs?

2 comments

runako

notnullorvoid 5 months ago

It's a very flawed test.

> We sourced real tasks that were previously solved by paid contributors.

It seems possible/likely the answers would in the training data (time dependant, maybe some were answered post training, but pre benchmark).

throwaway0123_5 5 months ago

They do address the potential for contamination in the paper fwiw:
> Note that Table 4 in Appendix A2 shows no clear performance improve-ment for tasks predating the models’ knowledge cutoffs, suggesting limited impact of contamination for those tasks.