← Back to context

Comment by EagnaIonat

6 months ago

> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,

There is nothing suspicious about this and the wording seems to be incorrect.

A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.

There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.

The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.

Trying to game blind set tests is nothing new and it gets very quickly found out.

What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.

> The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

What on earth? This is from Tamay Besiroglu at Epoch:

  Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training. 

So this "confusion" is because Epoch AI specifically told people it was a blind set! Despite the condescending tone, your comment is just plain wrong.