← Back to context

Comment by zone411

3 days ago

Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.

Grok 4 Heavy is not in the API.

Very impressive, but what do you think the chances are that this was in the training data?

  • > but what do you think the chances are that this was in the training data?

    Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.

  • You raise a good point. It seems like would be trivial to pick out some of the puzzles and remove all the answers from the training data.

    I wish Ai companies would do this.

  • The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).

    To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.