Comment by lufenialif2

10 hours ago

Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.

18 comments

lufenialif2

dvh 9 hours ago

Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?

rfurmani 9 hours ago

Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000
gus_massa 8 hours ago
4.5 hours × 2 "days", 100 Wats including support system.
I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.
Total = 900Wh = 3.24MJ
- qnleigh 3 hours ago
  
  100 watts seems very low. A single Nvidia GeForce RTX 5090 is rated at ~600 watts. Probably they are using many GPUs/TPUs in parallel.
  
  1 reply →
lufenialif2 9 hours ago

Convert libraries, elephants, etc into SI of course! Otherwise, they aren't really comparable...
thrance 9 hours ago

If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.
So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.
dortlick 9 hours ago

Kilocalories. A unit of energy that equals 4184 Joules.

pfortuny 8 hours ago

They can train it n “Crux Mathematicorum” and similar journals, which are collections of “interesting” problems and their solutions.

https://cms.math.ca/publications/crux

nicce 9 hours ago

Some unofficial comparison with costs of public models (performing worse): https://matharena.ai/imo/

So the real cost is something much more.

gjm11 5 hours ago

Human IMO contestants are also trained specifically on IMO problems.

vonneumannstan 10 hours ago

>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.

AlotOfReading 10 hours ago
Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?
Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.
- AlanYx 10 hours ago
  
  IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.
  
  2 replies →
- vonneumannstan 7 hours ago
  
  >How would the training data contain unpublished, newly submitted problems?
  I don't think I or op suggested it did.
sottol 10 hours ago

Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.