← Back to context

Comment by simondotau

4 days ago

If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.

You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.

We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.

  • My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?

    LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.

    • There's a difficult balance between letting the model simply memorize inputs, and forcing it to figure out a generalisations.

      When a model is "lossy" and can't reproduce the data by copying, it's forced to come up with rules to synthesise the answers instead, and this is usually the "intelligent" behavior we want. It should be forced to learn how multiplication works instead of storing every combination of numbers as a fact.

      Compression is related to intelligence: https://en.wikipedia.org/wiki/Kolmogorov_complexity

      5 replies →

    • There are many basic techniques in machine learning designed specifically to avoid memorizing training data. I contend any benchmark which can be “cheated” via memorizing training data is approximately useless. I think comparing how the models perform on say, today’s Connections would be far more informative despite the sample being much smaller. (Or rather any set for which we could guarantee the model hasn’t seen the answer, which I suppose is difficult to achieve since the Connections answers are likely Google-able within hours if not minutes).

People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.

This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".

> the real question is why other LLMs didn't do as well in this benchmark.

they do. There is a cycle for each major model:

- release new model(Gemini/ChatGPT/Grock N) which beats all current benchmarks

- some new benchmarks created

- release new model(Gemini/ChatGPT/Grock N+1) which beats benchmarks from previous step