← Back to context

Comment by MadxX79

9 hours ago

Same question I have for all these benchmarks:

What's going to stop e.g. OpenAI from hiring a bunch of teenagers to play these games non-stop for a month and annotate the game with their logic for deriving the rules, generate a data set based on those playthroughs and fine tuning the next version of chatgpt on all those playthroughs?

They would score much worse on the private set than the public set. And they haven't done this for any of the other ARC-AGI benchmarks, so why would they do it for this one?

Wrong question. I suggest:

1) Do models generalize?

2) If they do, and they generalize from this, is that a win?

Chollet was one of the first “they do not generalize” evangelists. I’d be curious to hear what he thinks now, because a) most disagree with him, and b) this test seems designed to get models that can generalize better at visual long context problem solving and agency, exactly where the bleeding edge is right now for needs with agentic systems.

  • Yeah, so you are agreeing that the benchmarks are useless because they don't answer those questions.

  • Can AI models generalize+ at any long context problem solving and agency regardless of modality? I think the answer is no, and this is why they are not yet AGI.

    + generalize being the key word.