Comment by frozenseven

7 months ago

You're not answering the question. Grok 4 also performs better on the semi-private evaluation sets for ARC-AGI-1 and ARC-AGI-2. It's across-the-board better.

4 comments

frozenseven

emp17344 7 months ago

If these things are truly exhibiting general reasoning, why do the same models do significantly worse on ARC-AGI-2, which is practically identical to ARC-AGI-1?

frozenseven 7 months ago
It's not identical. ARC-AGI-2 is more difficult - both for AI and humans. In ARC-AGI-1 you kept track of one (or maybe two) kinds of transformations or patterns. In ARC-AGI-2 you are dealing with at least three, and the transformation interact with one another in more complex ways.
Reasoning isn't an on-off switch. It's a hill that needs climbing. The models are getting better at complex and novel tasks.
- emp17344 7 months ago
  
  This simply isn’t the case. Humans actually perform better on ARC-AGI-2, according to their website: https://arcprize.org/leaderboard
  
  1 reply →