Comment by siva7

10 hours ago

Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?

5 comments

siva7

stingraycharles 1 hour ago

This is already well known, all these AI benchmarks use a different model to judge whether or not the solution was correct.

It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.

retinaros 8 hours ago

Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations

latentsea 2 hours ago

Models themselves definitely aren't getting better.

SpicyLemonZest 10 hours ago

Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?

operatingthetan 10 hours ago

Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.