Comment by deanCommie

16 hours ago

From Terence Tao, via mastodon [0]:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.

* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.

[0] https://mathstodon.xyz/@tao/114881418225852441

8 comments

deanCommie

Mond_ 16 hours ago

Unlike OpenAI, Deepmind at least signed up for the competition ahead of time.

Agree with Tao though, I am skeptical of any result of this type unless there's a lot of transparency, ideally ahead of time. If not ahead of time, then at least the entire prompt and fine-tune data that was used.

swyx 9 hours ago

OAI's story here: https://x.com/polynoamial/status/1947398532899738064
apparently IMO emailed them. but then they completed the IMO eval independently.

ferguess_k 16 hours ago

This is a fair reply, but TBH I don't think it's going to change much. The upper echelon of the human society has decided to move AI forward rapidly regardless of any consequences. The rest of us can only hold and pray.

justanotherjoe 14 hours ago
You are watching american money hard at work, my friend. It's either glorious or reckless, hard to tell for now.
- ferguess_k 14 hours ago
  
  Could be both, but one for different group of people.

dang 15 hours ago

Discussed here:

A human metaphor for evaluating AI capability - https://news.ycombinator.com/item?id=44622973 - July 2025 (30 comments)

scotty79 15 hours ago

Some of the critique is valid but some of it sounds like, "but the rules of the contest are that participants must use less than x joules of energy obtained from cellular respiration and have a singular consciousness"

I don't think anybody thinks AI was competing fair and within the rules that apply to humans. But if the humans were competing on the terms that AI solved those problems on, near-unlimited access to energy, raw compute and data, still very few humans could solve those problems within a reasonable timeframe. It would take me probably months or years to educate myself sufficiently to even have a chance.

earthicus 13 hours ago

I don't think that characterization is fair at all. It's certainly true that you, me, and most humans can't solve these problems with any amount of time or energy. But the problems are specifically written to be at the limit of what the actual high school students who participate can solve in four hours. Letting the actual students taking the test have four days instead of four hours would make a massive difference in their ability to solve them.
Said differently, the students, difficulty of the problems, and time limit are specifically coordinated together, so the amount of joules of energy used to produce a solution is not arbitrary. In the grand scheme of how the tech will improve over time, it seems likely that doesn't matter and the computers will win by any metric soon enough, but Tao is completely correct to point out that you haven't accurately told us what the machines can do today, in July 2025, without telling us ahead of time exactly what rules you are modifying.