> but what do you think the chances are that this was in the training data?
Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.
If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.
You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.
We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.
People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.
This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".
The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).
To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.
> but what do you think the chances are that this was in the training data?
Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.
If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.
You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.
We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.
8 replies →
People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.
This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".
> the real question is why other LLMs didn't do as well in this benchmark.
they do. There is a cycle for each major model:
- release new model(Gemini/ChatGPT/Grock N) which beats all current benchmarks
- some new benchmarks created
- release new model(Gemini/ChatGPT/Grock N+1) which beats benchmarks from previous step
"It also leads when considering only the newest 100 puzzles."
Be that as it may, that's not a zero-shot solution.
You raise a good point. It seems like would be trivial to pick out some of the puzzles and remove all the answers from the training data.
I wish Ai companies would do this.
The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).
To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.