Comment by porphyra
4 days ago
Honestly if it actually does score 44.4% on Humanity's Last Exam, that would be super impressive as Gemini 2.5 Pro and o3 with tools only score 26.9% and 24.9%.
4 days ago
Honestly if it actually does score 44.4% on Humanity's Last Exam, that would be super impressive as Gemini 2.5 Pro and o3 with tools only score 26.9% and 24.9%.
Is that not just how scaling goes? It generally feels like the top models are mostly interchangeable and the one that came out at time t+1 will be better than earlier models from time t.
Grok 4 has probably been training when O3 was released, and now that Grok 4 is released, OpenAI is probably preparing O4, Google is preparing Gemini 3 and soon new SOTA benchmark scores will appear.
So it is impressive but not surprising, no? Whoever releases the latest model and has sufficient compute will be SOTA.
Meta had enough compute I think. No SOTA though.
I dunno, "with tools" means different things for different models. It depends on what tools you give it access to. HLE demands a lot of specialized stuff. Like an interpreter for the esoteric programming language Piet for two questions. If you're not standardizing the set of tools, these aren't apples-to-apples numbers.
Even without tools it also outperforms Gemini 2.5 pro and o3, 25.4% compared to 21.6% and 21.0%. Although I wonder if any of the exam was leaked into the training set or if it was specifically trained to be good at benchmarks, llama 4 style.
would like to see FrontierMath results. Don't have a lot of personal trust in HLE.
"Don't have a lot of personal trust in HLE."
Why?
A lot of the questions are simple subject matter knowledge, and some of them are multiple-choice. Asking LLMs multiple-choice questions is scientific malpractice: it is not interesting that statistical next-token predictors can attain superhuman performance on multiple choice tests. We've all known since children that you can go pretty far on a Scantron by using surface heuristics and a vague familiarity with the material.
I will add that, as an unfair smell test, the very name "Humanity's Last Exam" implies an arrogant contempt for scientific reasoning, and I would not be at all surprised if they were corrupt in a similar way as Frontier Math and OpenAI - maybe xAI funded HLE in exchange for peeking at the questions.
3 replies →
I only know math and out of the 2 examples of math questions I think one of them is wrong. So out of this very limited data I have I don't really trust their problems. OK I'm not sure completely about my claim.