Comment by z7

3 days ago

How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555

11 comments

They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.

CamperBob2 3 days ago
They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions
That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.
- nwienert 3 days ago
  
  Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.
  
  3 replies →
theshrike79 3 days ago

I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.
But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.
With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`
djmips 3 days ago

Well try it again and report back.

dbagr 3 days ago

As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.

ericlewis 3 days ago

I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.