Comment by chvid

6 days ago

In a few months (weeks, days - maybe it has already happened) models will have much better performance on this test.

Not because of actual increased “intelligence” but because the test would be included in model’s training data - either directly or indirectly where model developers “tune” their model to give better performance on this particular attention driving test.

8 comments

chvid

sorokod 6 days ago

From the post: "Evaluation began immediately after the 2025 IMO problems were released to prevent contamination."

Doe this address your concern?

os2warpman 6 days ago
What they mean is that in a couple of weeks there are going to be stories titled "LLMS NOW BETTER THAN HUMANS AT 2025 INTERNATIONAL MATH OLYMPIAD" (stories published as thinly-veiled investment solicitations) but in reality they're still shitty-- they've just had the answers fed in to be spit back out.
- sorokod 6 days ago
  
  Companies would game metrics whenever they have the opportunity. What else is new?
  
  1 reply →
chvid 6 days ago

Not really.

yunwal 6 days ago

Luckily there’s a new set of problems every year

chvid 6 days ago
You can really only do a fair reproducible test if the models are static and not sitting behind an api where you have no idea on how they are updated or continuously tweaked.
- chvid 6 days ago
  
  This particular test is heralded as some sort of breakthrough and the companies in this field are raising billions of dollars from investors and paying their star employees tens of millions.
  The economic incentives to tweak, tune, or cheat are through the roof.