← Back to context

Comment by lufenialif2

10 hours ago

Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.

Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?

  • Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000

  • 4.5 hours × 2 "days", 100 Wats including support system.

    I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.

    Total = 900Wh = 3.24MJ

    • 100 watts seems very low. A single Nvidia GeForce RTX 5090 is rated at ~600 watts. Probably they are using many GPUs/TPUs in parallel.

      1 reply →

  • If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.

    So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.

>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.

  • Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?

    Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.

    • IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.

      2 replies →

    • >How would the training data contain unpublished, newly submitted problems?

      I don't think I or op suggested it did.

  • Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.