← Back to context

Comment by UltraSane

3 days ago

"Don't have a lot of personal trust in HLE."

Why?

A lot of the questions are simple subject matter knowledge, and some of them are multiple-choice. Asking LLMs multiple-choice questions is scientific malpractice: it is not interesting that statistical next-token predictors can attain superhuman performance on multiple choice tests. We've all known since children that you can go pretty far on a Scantron by using surface heuristics and a vague familiarity with the material.

I will add that, as an unfair smell test, the very name "Humanity's Last Exam" implies an arrogant contempt for scientific reasoning, and I would not be at all surprised if they were corrupt in a similar way as Frontier Math and OpenAI - maybe xAI funded HLE in exchange for peeking at the questions.

  • "A lot of the questions are simple subject matter knowledge" Aren't most questions incredibly hard?

    • "Simple" is unfair to the humans who discovered that knowledge, but not to the LLM. The point is that such questions are indistinguishable from niche trivia - the questions aren't actually "hard" in a cognitive sense, merely esoteric as a matter of surface feature identification + NLP. I don't know anything about hummingbird anatomy but I am not interested in hummingbirds and haven't read papers about them. Does it make sense to say such questions are "hard?" Are we talking about hardness of a trivia game, or actual cognitive ability? And it's frustrating to see these lumped into computational questions, analysis questions, etc etc. What exactly is HLE benchmarking? It is not a scientifically defensible measurement. It seems like the express purpose of the test is

      a) to make observers say "wow those questions sure are hard!" without thinking carefully about what that means for an LLM versus a human

      b) to let AI folks sneer that the LLM might be smarter than you because it can recite facts about category theory and you can't

      (Are my cats smarter than you because they know my daily habits and you don't? The conflation of academically/economically useful knowledge with "intelligence" is one of AI's dumbest and longest-standing blunders.)

    • Some of the questions are based on research papers, but an LLM that can search the internet may be able to look up the answer essentially instead of thinking through it by itself.

I only know math and out of the 2 examples of math questions I think one of them is wrong. So out of this very limited data I have I don't really trust their problems. OK I'm not sure completely about my claim.