Comment by porphyra

4 days ago

Honestly if it actually does score 44.4% on Humanity's Last Exam, that would be super impressive as Gemini 2.5 Pro and o3 with tools only score 26.9% and 24.9%.

11 comments

porphyra

Sol- 3 days ago

Is that not just how scaling goes? It generally feels like the top models are mostly interchangeable and the one that came out at time t+1 will be better than earlier models from time t.

Grok 4 has probably been training when O3 was released, and now that Grok 4 is released, OpenAI is probably preparing O4, Google is preparing Gemini 3 and soon new SOTA benchmark scores will appear.

So it is impressive but not surprising, no? Whoever releases the latest model and has sufficient compute will be SOTA.

Davidzheng 3 days ago

Meta had enough compute I think. No SOTA though.

Imnimo 4 days ago

I dunno, "with tools" means different things for different models. It depends on what tools you give it access to. HLE demands a lot of specialized stuff. Like an interpreter for the esoteric programming language Piet for two questions. If you're not standardizing the set of tools, these aren't apples-to-apples numbers.

porphyra 4 days ago

Even without tools it also outperforms Gemini 2.5 pro and o3, 25.4% compared to 21.6% and 21.0%. Although I wonder if any of the exam was leaked into the training set or if it was specifically trained to be good at benchmarks, llama 4 style.

Davidzheng 4 days ago

would like to see FrontierMath results. Don't have a lot of personal trust in HLE.

UltraSane 3 days ago
"Don't have a lot of personal trust in HLE."
Why?
- AIPedant 3 days ago
  
  A lot of the questions are simple subject matter knowledge, and some of them are multiple-choice. Asking LLMs multiple-choice questions is scientific malpractice: it is not interesting that statistical next-token predictors can attain superhuman performance on multiple choice tests. We've all known since children that you can go pretty far on a Scantron by using surface heuristics and a vague familiarity with the material.
  I will add that, as an unfair smell test, the very name "Humanity's Last Exam" implies an arrogant contempt for scientific reasoning, and I would not be at all surprised if they were corrupt in a similar way as Frontier Math and OpenAI - maybe xAI funded HLE in exchange for peeking at the questions.
  
  3 replies →
- Davidzheng 3 days ago
  
  I only know math and out of the 2 examples of math questions I think one of them is wrong. So out of this very limited data I have I don't really trust their problems. OK I'm not sure completely about my claim.