Comment by XCSme

8 days ago

But why only a +0.5% increase for MMMU-Pro?

8 comments

XCSme

Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.

kenjackson 8 days ago

Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

XCSme 8 days ago
But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
- Davidzheng 8 days ago
  
  I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)
- kenjackson 8 days ago
  
  Are humans 100%?
  
  3 replies →