Comment by operatingthetan

12 hours ago

>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.

11 comments

operatingthetan

nananana9 4 minutes ago

But that requires me to do things :(

siva7 12 hours ago

Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?

stingraycharles 3 hours ago

This is already well known, all these AI benchmarks use a different model to judge whether or not the solution was correct.
It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.
retinaros 9 hours ago
Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
- latentsea 4 hours ago
  
  Models themselves definitely aren't getting better.
SpicyLemonZest 11 hours ago

Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
operatingthetan 12 hours ago

Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.

ZeroGravitas 12 hours ago

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.

lambda 11 hours ago

Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.
The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.

Leynos 12 hours ago

Also, fuzz your benchmarks

Aperocky 8 hours ago

solution is simple:

if bug { dont }