Comment by zone411
3 days ago
Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.
Grok 4 Heavy is not in the API.
3 days ago
Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.
Grok 4 Heavy is not in the API.
Very impressive, but what do you think the chances are that this was in the training data?
> but what do you think the chances are that this was in the training data?
Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.
If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.
11 replies →
"It also leads when considering only the newest 100 puzzles."
1 reply →
You raise a good point. It seems like would be trivial to pick out some of the puzzles and remove all the answers from the training data.
I wish Ai companies would do this.
The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).
To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.
Grok 4 Heavy is not a model, it's just managing multiple instances of grok-4 from what I can tell