← Back to context

Comment by chvid

6 days ago

In a few months (weeks, days - maybe it has already happened) models will have much better performance on this test.

Not because of actual increased “intelligence” but because the test would be included in model’s training data - either directly or indirectly where model developers “tune” their model to give better performance on this particular attention driving test.

From the post: "Evaluation began immediately after the 2025 IMO problems were released to prevent contamination."

Doe this address your concern?

  • What they mean is that in a couple of weeks there are going to be stories titled "LLMS NOW BETTER THAN HUMANS AT 2025 INTERNATIONAL MATH OLYMPIAD" (stories published as thinly-veiled investment solicitations) but in reality they're still shitty-- they've just had the answers fed in to be spit back out.

Luckily there’s a new set of problems every year

  • You can really only do a fair reproducible test if the models are static and not sitting behind an api where you have no idea on how they are updated or continuously tweaked.

    • This particular test is heralded as some sort of breakthrough and the companies in this field are raising billions of dollars from investors and paying their star employees tens of millions.

      The economic incentives to tweak, tune, or cheat are through the roof.