← Back to context

Comment by diggan

3 days ago

> but what do you think the chances are that this was in the training data?

Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.

If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.

  • You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.

    We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.

    • My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?

      LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.

      7 replies →

  • People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.

    This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".

  • > the real question is why other LLMs didn't do as well in this benchmark.

    they do. There is a cycle for each major model:

    - release new model(Gemini/ChatGPT/Grock N) which beats all current benchmarks

    - some new benchmarks created

    - release new model(Gemini/ChatGPT/Grock N+1) which beats benchmarks from previous step