Comment by diggan

3 days ago

> but what do you think the chances are that this was in the training data?

Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.

14 comments

diggan

simondotau 3 days ago

If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.

pornel 3 days ago
You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.
We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.
- simondotau 3 days ago
  
  My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?
  LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.
  
  7 replies →
Workaccount2 3 days ago

People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.
This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".
riku_iki 3 days ago

> the real question is why other LLMs didn't do as well in this benchmark.
they do. There is a cycle for each major model:
- release new model(Gemini/ChatGPT/Grock N) which beats all current benchmarks
- some new benchmarks created
- release new model(Gemini/ChatGPT/Grock N+1) which beats benchmarks from previous step

frozenseven 3 days ago

"It also leads when considering only the newest 100 puzzles."

bigyabai 3 days ago

Be that as it may, that's not a zero-shot solution.