Comment by brucethemoose2
2 years ago
Then its not really a benchmark? Model trainers and researchers are not continuously testing, they dump something then move on.
The answer is standard "secret" closed source tests, performed in a controlled environment.
I know, I don't like the sound of it either, but in this case I think closed source + a single overseeing entity is the best solution, by far. Facebook already made something like this, but they only went halfway (publishing the questions while keeping the answers secret).
Interestingly, the college board might be the best entity to do this.
Colleges are apparently no longer using standardized tests so why not put that towards the AI?
It's really exactly what we need. Novel questions with minimal re-use created and curated by an independent team of experts designed to assess general intelligence across multiple dimensions.