Comment by modeless

4 days ago

Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.

34 comments

modeless

vessenes 3 days ago

Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.

That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.

On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.

dbagr 3 days ago
Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.
- z7 3 days ago
  
  How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?
  https://x.com/arcprize/status/1943168950763950555
  
  11 replies →
- vessenes 3 days ago
  
  anecdotally, output in my tests is pretty good. It's at least competitive to SOTA from other providers right now.

esafak 4 days ago

I wish the coding models were available in coding agents. Haven't seem them anywhere.

vincent_s 3 days ago
Grok 4 is now available in Cursor.
- dmix 3 days ago
  
  I just tried it, it was very slow like Gemini.
  But I really liked the few responses it gave me, highly technical language. Not the flowery stuff you find in ChatGPT or Gemini, but much more verbose and thorough than Claude.
  
  1 reply →
- markdog12 3 days ago
  
  Interesting, I have the latest update and I don't see it in the models list.
  
  2 replies →
justarobert 3 days ago

Plenty like Aider and Cline can connect to pretty much any model with an API.

Squarex 3 days ago

Even if one does not have a positive view of Elon Musk, the catching up of Grok to the big three (Google, OpenAI, Anthropic) is incredible. They are now at the same level aproximately.

mhoad 4 days ago

[flagged]

Workaccount2 4 days ago
Well we have GPT-5 and Gemini 3 in the wings so it wouldn't be surprising if it is SOTA for a few days.
- monkeydust 3 days ago
  
  yup this will probably trigger the next wave of releases, someone had to go first.
  
  1 reply →

zamalek 3 days ago

> Seems like it is indeed the new SOTA model, with significantly better scores than o3

It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).

fdsjgfklsfd 3 days ago
Wait, are you implying that Grok 3 is "censored" because it aligns with "progressive" views?
- strangefellow 3 days ago
  
  I think they're implying that Grok is smarter because it's less censored, and then separately noting that it still tends to be fairly progressive despite the lack of censorship (when it's not larping as Hitler) even though it was presumably trained on the worst humanity has to offer.
  Man, that sentence would have been incomprehensible just a couple years ago.
  
  1 reply →
- Rover222 3 days ago
  
  As has been the case in almost all US social media companies until the last year. They were all heavily biased and censored towards left-leaning views.