Comment by pornel

7 months ago

You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.

We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.

8 comments

pornel

simondotau 7 months ago

My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?

LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.

pornel 7 months ago
There's a difficult balance between letting the model simply memorize inputs, and forcing it to figure out a generalisations.
When a model is "lossy" and can't reproduce the data by copying, it's forced to come up with rules to synthesise the answers instead, and this is usually the "intelligent" behavior we want. It should be forced to learn how multiplication works instead of storing every combination of numbers as a fact.
Compression is related to intelligence: https://en.wikipedia.org/wiki/Kolmogorov_complexity
- frozenseven 7 months ago
  
  You're not answering the question. Grok 4 also performs better on the semi-private evaluation sets for ARC-AGI-1 and ARC-AGI-2. It's across-the-board better.
  
  4 replies →
kevinventullo 7 months ago

There are many basic techniques in machine learning designed specifically to avoid memorizing training data. I contend any benchmark which can be “cheated” via memorizing training data is approximately useless. I think comparing how the models perform on say, today’s Connections would be far more informative despite the sample being much smaller. (Or rather any set for which we could guarantee the model hasn’t seen the answer, which I suppose is difficult to achieve since the Connections answers are likely Google-able within hours if not minutes).