OpenAI claims gold-medal performance at IMO 2025

3 days ago (twitter.com)

From Noam Brown

https://x.com/polynoamial/status/1946478258968531288

"When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is."

and

"This was a small team effort led by @alexwei_ . He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community."

Interesting that the proofs seem to use a limited vocabulary: https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

Why waste time say lot word when few word do trick :)

Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

  • Interesting observation. One one hand, these resemble more the notes that an actual participant would write while solving the problem. Also, less words = less noise, more focus. But also, specifically for LLMs that output one token at a time and have a limited token context, I wonder if limiting itself to semantically meaningful tokens can be create longer stretches of semantically coherent thought?

    • The original thread mentions “test-time compute scaling” so they had some architecture generating a lot of candidate ideas to evaluate. Minimizing tokens can be very meaningful from a scalability perspective alone!

      1 reply →

  • > Also worth pointing out that Alex Wei is himself a gold medalist at IOI.

    Terence Tao also called it, that the top LLMs would get gold this year in a recent podcast.

  • In transformers generating each token takes the same amount of time, regardless of how much meaning it carries. By cutting out the filler from the text, you get a huge speedup.

    • Except generating more tokens also effectively extends the computational power beyond the depth of the circuit, which is why chain of thought works in the first place. Even sampling only dummy tokens that don't convey anything still provides more computational power.

      3 replies →

I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.

  • Related — these videos give a sense of how someone might actually go about thinking through and solving these kinds of problems:

    - A 3Blue1Brown video on a particularly nice and unexpectedly difficult IMO problem (2011 IMO, Q2): https://www.youtube.com/watch?v=M64HUIJFTZM

    -- And another similar one (though technically Putnam, not IMO): https://www.youtube.com/watch?v=OkmNXy7er84

    - Timothy Gowers (Fields Medalist and IMO perfect scorer) solving this year’s IMO problems in “real time”:

    -- Q1: https://www.youtube.com/watch?v=1G1nySyVs2w

    -- Q4: https://www.youtube.com/watch?v=O-vp4zGzwIs

  • I like watching youtube videos solving these problems. They're deceptively simple. I remember reading one:

    x+y=1

    xy=1

    The incredible thing is the explanation uses almost all reasoning steps that I am familiar with from basic algebra, like factoring, quadratic formula, etc. But it just comes together so beautifully. It gives you the impression that if you thought about it long enough, surely you would have come up with the answer, which is obviously wrong, at least in my case.

    https://www.youtube.com/watch?v=csS4BjQuhCc

    • This is slightly tedious to do by hand but there isn't really anything interesting going on in that problem - it's just solving a quadratic equation over the complex numbers.

      6 replies →

  • I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?

    • The problems are chosen by representatives from all the countries. So every country has someone who knows the full exam before the participants get it. Security is on the honour system, but it seems to mostly work.

    • iirc, the IMO system automatically translates the questions into 50 languages, after they are entered in English.

  • How do those compare to leetcode hard problems?

    • Depends on how hard, but the “average hard” leetcode problem is much easier. These will be more like the ACM ICPC level questions, which I’d put at the “hard hard” leetcode level (also this is a collegiate competition rather than high school, but with broader participation).

Terence Tao on the matter - https://imgur.com/a/terence-tao-on-supposed-gold-imo-sMKP0bm

  • Actual post instead of ad-decorated screnshot: https://mathstodon.xyz/@tao/114881418225852441 (thread continued in https://mathstodon.xyz/@tao/114881419368778558 and https://mathstodon.xyz/@tao/114881420636881657).

    • Fair points, but the reason everyone is amazed is that five years ago this was entirely impossible for computers irrespective of the competition format or rules.

      It’s as-if we had learned whale song, and then within two years a whale had won a Nobel prize for their research in high pressure aquatic environments. You’d similarly get naysayers debating the finer points of what special advantage whales may have in that particular field, neglecting the stunned shock of the general population — “Whales are publishing research papers now!? Award winning papers at that!?”

      4 replies →

  • It's a good point - IMO is about performance under some specific resource constraints, and those constraints don't make sense for AIs. But I wonder how far we are from an AI solving a well-studied unsolved math problem. That would be more of a decisive "quantum supremacy" type milestone.

  • > there will be a proposal at some point to actually have an AI math Olympiad where at the same time as the human contestants get the actual Olympiad problems, AI’s will also be given the same problems, the same time period and the outputs will have to be graded by the same judges, which means that it’ll have be written in natural language rather than formal language.[1]

    Last month, Tao himself said that we can compare humans and AIs at IMO. He even said such AI didn't exist yet and AIs won't beat IMO in 2025. And now that AIs can compete with humans at IMO under the same conditions that Tao mentioned, suddenly it becomes an apples-to-oranges comparison?

    [1] https://lexfridman.com/terence-tao-transcript/

  • He is basically asking OpenAI to publish their methodology so we can understand the real state of AI in solving math problems.

From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

  • In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.

    Edit: Fixed P4 -> P3. Thanks.

  • To me, this is a tell of human-involvement in the model solution.

    There is no reason why machines would do badly on exactly the problem which humans do badly as well - without humans prodding the machine towards a solution.

    Also, there is no reason why machines could not produce a partial or wrong answer to problem 6 which seems like survivor bias to me. ie, that only correct solutions were cherrypicked.

    • There is at least one reason - it was a harder problem. Agreed that which IMO problems are hard for a human IMO participant and which are hard for an LLM are different things, but seems like they should be positively correlated at least?

      1 reply →

    • While it's zero proof, since the data used for training is human generated, you raise an interesting point: the financial stakes are so high in LLM research that we should be skeptical of all frontier results.

      An internet connected machine that reasons like humans was by default considered a fraud 5 years ago; it's not unthinkable some researchers would fake it till they made it, but of course you need proof of it before making such an accusation.

    • > There is no reason why machines would do badly on exactly the problem which humans do badly as well

      Unless the machine is trained to mimic human thought process.

    • Lmao.

      You know IMO questions are not all equally difficult, right? They're specifically designed to vary in difficulty. The reason that problem 6 is hard for both humans and LLM is... it's hard! What a surprise.

    • Lol the OpenAI naysayers on this site are such conspiracy theorists.

      There are many things that are hard for AI’s for the same reason they’re hard for humans. There are subtleties in complexity that make challenging things universal.

      Obviously the model was trained on human data so its competencies lie in what other humans have provided input for over the years in mathematics, but that isn’t data contamination, that’s how all humans learn. This model, like the contestants, never saw the questions before.

These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

  • Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.

    • They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).

      8 replies →

  • From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.

    Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.

    What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.

    Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.

  • >> This is not a model specialized to IMO problems.

    How do you know?

    • Yeah, looking at the GP ... say a sequence of things that are true and plausible. That add your strong, unsupported claim at the end. I remember the approach from when I studied persuasion techniques...

  • > The answers are not in the training data.

    > This is not a model specialized to IMO problems.

    Any proof?

    • There's no proof that this is not made up, let alone any shred of transparency or reproducibility.

      There are trillions of dollars at stake in hyping up these products; I take everything these companies write with a cartload of salt.

    • No, and they're lying on the most important claim: that this is not a model specialized to IMO problems.

      From the thread:

      > just to be clear: the IMO gold LLM is an experimental research model.

      The thread tried to muddy the narrative by saying the methodology can generalize, but no one is claiming the actual model is a generalized model.

      There'd be a massively different conversation needed if a generalized model that could become the next iteration of ChatGPT had achieved this level of performance.

  • It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

    E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

    Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

    • I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.

      3 replies →

    • Why is "almost certainly"? The link you provided has this to say:

      > 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

      1 reply →

    • Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.

  • [flagged]

    • No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)

      7 replies →

    • I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.

      They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.

      There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.

      6 replies →

    • Getting gold at the IMO is pretty damn hard.

      I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.

      I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.

      I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.

      7 replies →

    • IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.

Google also joined IMO, and got gold prize.

https://x.com/natolambert/status/1946569475396120653

OAI announced early, probably we will hear announcement from Google soon.

  • Google’s AlphaProof, which got a silver last year, has been using a neural symbolic approach. This gold from OpenAI was pure LLM. We’ll have to see what Google announces, but the LLM approach is interesting because it will likely generalize to all kinds of reasoning problems, not just mathematical proofs.

    • OpenAI’s systems haven’t been pure language models since the o models though, right? Their RL approach may very well still generalize, but it’s not just a big pre-trained model that is one-shotting these problems.

      The key difference is that they claim to have not used any verifiers.

      2 replies →

    • > it will likely generalize to all kinds of reasoning problems, not just mathematical proofs

      Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.

    • I’m much more excited about the formalized approach, as LLM’s are susceptible to making things up. With formalization, we can be mathematically certain that a proof is correct. This could plausibly lead to machines surpassing humans in all areas of math. With a “pure English” approach, you still need a human to verify correctness.

    • Neither Gemini or OpenAI have open models. We don’t know for sure what’s happening underneath.

  • Given the Noam Brown comment ("It was a surprise even to many researchers at OpenAI") it seems extra surprising if multiple labs achieved this result at once.

    There's a comment on this twitter thread saying the Google model was using Lean, while IIUC the OpenAI one was pure LLM reasoning (no tools). Anyone have any corroboration?

    In a sense it's kinda irrelevant, I care much more about the concrete things AI can achieve, than the how. But at the same time it's very informative to see the limits of specific techniques expand.

Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206

  • > it’s also more efficient [than o1 or o3] with its thinking.

    "So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of his, she never wins (outcomes: may be infinite or she may lose by sum if she picks badly; but no win). So she does NOT have winning strategy at λ=c. So at equality, neither player has winning strategy."[1]

    Why use lot word when few word do trick?

    1. https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

  • That's a big leap from "answering test questions" to "contributing to scientific discovery".

    • Having spent tens of thousands of hours contributing to scientific discovery by reading dense papers for a single piece of information, reverse engineering code written by biologists, and tweaking graphics to meet journal requirements… I can say with certainty it’s already contributing by allowing scientists to spend time on science versus spending an afternoon figuring out which undocumented argument in a R package from 2008 changes chart labels.

      4 replies →

  • Yeah that’s the dream, but same as with the bar exams, they are fine tuning the models for specific tests. Which probably the model even has been trained on previous version of those tests

  • What's the clear path to improved efficiency now that we've reached peak data?

    • > now that we've reached peak data?

      A) that's not clear

      B) now we have "reasoning" models that can be used to analyse the data, create n rollouts for each data piece, and "argue" for / against / neutral on every piece of data going into the model. Imagine having every page of a "short story book" + 10 best "how to write" books, and do n x n on them. Huge compute, but basically infinite data as well.

      We went from "a bunch of data" to "even more data" to "basically everything we got" to "ok, maybe use a previous model to sort through everything we got and only keep quality data" to "ok, maybe we can augment some data with synthetic datasets from tools etc" to "RL goes brrr" to (point B from above) "let's mix the data with quality sources on best practices".

      4 replies →

    • The thing is, people claimed already a year or two ago that we'd reached peak data and progress would stall since there was no more high-quality human-written text available. Turns out they were wrong, and if anything progress accelerated.

      The progress has come from all kinds of things. Better distillation of huge models to small ones. Tool use. Synthetic data (which is not leading to model collapse like theorized). Reinforcement learning.

      I don't know exactly where the progress over the next year will be coming from, but it seems hard to believe that we'll just suddenly hit a wall on all of these methods at the same time and discover no new techniques. If progress had slowed down over the last year the wall being near would be a reasonable hypothesis, but it hasn't.

      2 replies →

    • there is also huge realm of private/commercial data which is not absorbed by LLMs yet. I think there are way more private/commercial data than public data.

  • > I think we’re close to AI substantially contributing to scientific discovery.

    The new "Full Self-Driving next year"?

    • "AI" already contributes "substantially" to "scientific discovery". It's a very safe statement to make, whereas "full self-driving" has some concrete implications.

      2 replies →

    • As an aside, that is happening in China right now in commercial vehicles. I rode a robotaxi last month in Beijing, and those services are expanding throughout China. Really impressive.

  • How is a claim, "clear evidence" to anything?

    • I read the GP's comment as "but [assuming this claim is correct], this is clear evidence to the contrary."

    • Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.

      Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.

      16 replies →

  • Thing is, for example, all of classical physics can be derived from Newton's laws, Maxwell's equations and the laws of Thermodynamics, all of which can be written on a slip of paper.

    A sufficiently brilliant and determined human can invent or explain everything armed only with this knowledge.

    There's no need to train him on a huge corpus of text, like they do with ChatGPT.

    Not sure what this model's like, but I'm quite certain it's not trained on terabytes of Internet and book dumps, but rather is trained for abstract problem solving in some way, and is likely much smaller than these trillion parameter SOTA transformers, hence is much faster.

    • If you look at the history of physics I don't think it really worked like that. It took about three centuries from Newton to Maxwell because it's hard to just deduce everything from basic principles.

      3 replies →

    • And the billions of years of evolution and the language that you use to explain the task to him and and the schooling he needs to understand what you're saying it and... and and and?

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

  • Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

    You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.

    • >> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

      Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.

      7 replies →

    • Without sharing their methodology, how can we trust the claim ? questions like:

      1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..

      I am excited to hear about it, but I remain skeptical.

    • >Why is that less exciting?

      Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.

      People would probably not be as excited about the prospect of employing me to throw rocks for them.

      3 replies →

    • I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

      Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

      This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

      Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

      If you're not familiar with System 1 / System 2, it's googlable .

      23 replies →

    • Because the usefulness of an AI model is reliably solving a problem, not being able to solve a problem given 10,000 tries.

      Claude Code is still only a mildly useful tool because it's horrific beyond a certain breadth of scope. If I asked it to solve the same problem 10,000 times I'm sure I'd get a great answer to significantly more difficult problems, but that doesn't help me as I'm not capable of scaling myself to checking 10,000 answers.

  • >if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

    That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.

    • The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.

      9 replies →

  • I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.

  • > if OpenAI ran this 10000 times in parallel and cherry-picked the best one

    This is almost certainly the case, remember the initial o3 ARC benchmark? I could add this is probably multi-agent system as well, so the context length restriction can be bypassed.

    Overall, AI good at math problems doesn't make news to me. It is already better than 99.99% of humans, now it is better than 99.999% of us. So ... ?

  • > what tools were used and how the model used them

    According to the twitter thread, the model was not given access to tools.

Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

  • [flagged]

    • I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.

      For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.

      2 replies →

    • > I assume you are aware of the standard of Olympiad problems and that they are not particularly high.

      Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.

      The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.

      You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.

    • I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.

      Here's an example problem 5:

      Let a1,a2,…,an be distinct positive integers and let M=max⁡1≤i<j≤n.

      Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.

      9 replies →

It’s interesting that this is a competition elite enough that several posters on a programming website don’t seem to understand what it is.

My very rough napkin math suggests that against the US reference class, imo gold is literally a one in a million talent (very roughly 20 people who make camp could get gold out of very roughly twenty million relevant high schoolers).

  • I’m not trying to take away from the difficulty of the competition. But I went to a relatively well regarded high school and never even heard of IMO until I met competitors during undergrad.

    I think that the number of students who are even aware of the competition is way lower than the total number of students.

    I mean, I don’t think I’d have been a great competitor even if I tried. But I’m pretty sure there are a lot of students that could do well if given the opportunity.

    • Are you in the US? Have you heard of the AMC (used to be AHMSE) and the AIME? Those are the feeders to the IMO.

      If your school had a math team and you were on it, would be surprised if you didn't hear of it

      You may not have heard of the IMO because no one in school district, possibly even state got in. It is extremely selective (like 20 students in the entire country)

      3 replies →

In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.

  • I doubt this is coming from RLHF - tweets from the lead researcher state that this result flows from a research breakthrough which enables RLVR on less verifiable domains.

    • Math RLHF already has verifiable ground truth/right vs wrong, so I don't what this distinction really shows.

      And AI changes so quickly that there is a breakthrough every week.

      Call my cynical, but I think this is an RLHF/RLVR push in a narrow area--IMO was chosen as a target and they hired specifically to beat this "artificial" target.

      1 reply →

  • They were hiring IMO winners because IMO winners tend to be good at working on AI, not because they had the people specifically to make the AI better at math.

    • Uh no. I’m a math RLHF’er. When I get hired, I work on math/logic up to masters level because that’s my qualifications. Masters and PHD work on masters and PHD level. And IMO work on IMO math.

      Every skill and skill level is specifically assigned and hired in the RLHF world.

      Sometime the skill levels are fuzzier, but that’s usually very temporary.

      And as been said already, IMO is a specific skill that even PHD math holders aren’t universally trained for.

Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

  • While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.

    • The person who guessed 16% would have a lower Brier score (lower is better) and someone who estimated 100%, beyond being correct, would have the lowest possible value.

      1 reply →

    • A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.

    • From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:

      The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).

      Here we are supposing that the increase in training data is not the main explanatory factor.

      This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.

      1 reply →

    • The whole point is to make many such predictions and experience many outcomes. The goal is for your 70% predictions to be correct 70% of the time. We all have a gap between how confident we are and how often we're correct. Calibration, which can be measured by making many predictions, is about reducing that gap.

    • If i predict that my next dice roll will be a 5 with 16% certainty and i do indeed roll a 5, was my prediction wrong?

  • Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

    We may certainly hope Eliezer's other predictions don't prove so well-calibrated.

  • Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?

    • > why shouldn't I assume they're pulled from thin air?

      You definitely should assume they are. They are rationalists, the modus operandi is to pull stuff out of thin air and slap a single digit precision percentage prediction in front to make it seems grounded in science and well thought out.

    • You should basically assume they are pulled from thin air. (Or more precisely, from the brain and world model of the people making the prediction.)

      The point of giving such estimates is mostly an exercise in getting better at understanding the world, and a way to keep yourself honest by making predictions in advance. If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability. (The quantitative way to compare these things is to see who has lower log loss [1].)

      [1] https://en.wikipedia.org/wiki/Cross-entropy

      3 replies →

    • >Who are these people

      Clowns, mostly. Yudkowski in particular, whose only job today seems to be making awful predictions and letting lesswrong eat it up when one out of a hundred ends up coming true, solidifying his position as AI-will-destroy-the-world messiah. They make money from these outlandish takes, and more money when you keep talking about them.

      It's kind of like listening to the local drunkard at the bar that once in a while ends up predicting which team is going to win in football inbetween drunken and nonsensical rants, except that for some reason posting the predictions on the internet makes him a celebrity, instead of just a drunk curiosity.

  • One of the most worrying trends in AI has been how wrong the experts have been with overestimating timelines.

    On the other hand, I think human hubris naturally makes us dramatically overestimate how special brains are.

  • Off topic, but am I the only one getting triggered every time I see a rationalist quantify their prediction of the future with single digit accuracy? It's like their magic way of trying to get everyone to forget that they reached their conclusion in completely hand-wavy way, just like every other human being. But instead of saying "low confidence" or "high confidence" like the rest of us normies, they will tell you they think there is 16.27% chance because they really really want you to be aware that they know bayes theorem.

    • Interestingly, this is actually a question that's been looked at empirically!

      Take a look at this paper: https://scholar.harvard.edu/files/rzeckhauser/files/value_of...

      They took high-precision forecasts from a forecasting tournament and rounded them to coarser buckets (nearest 5%, nearest 10%, nearest 33%), to see if the precision was actually conveying any real information. What they found is that if you rounded the forecasts of expert forecasters, Brier scores got consistently worse, suggesting that expert forecast precision at the 5% level is still conveying useful, if noisy, information. They also found that less expert forecasters took less of a hit from rounding their forecasts, which makes sense.

      It's a really interesting paper, and they recommend that foreign policy analysts try to increase precision rather than retreating to lumpy buckets like "likely" or "unlikely".

      Based on this, it seems totally reasonable for a rationalist to make guesses with single digit precision, and I don't think it's really worth criticizing.

      16 replies →

    • Would you also get triggered if you saw people make a bet at, say, $24 : $87 odds? Would you shout: "No! That's too precise, you should bet $20 : $90!"? For that matter, should all prices in the stock market be multiples of $1, (since, after all, fluctuations of greater than $1 are very common)?

      If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance, not to round the mean to a whole number.

      Also, in log odds, the difference between 5% and 10% is about the same as the difference between 40% and 60%. So using an intermediate value like 8% is less crazy than you'd think.

      People writing comments in their own little forum where they happen not to use sig-figs to communicate uncertainty is probably not a sinister attempt to convince "everyone" that their predictions are somehow scientific. For one thing, I doubt most people are dumb enough to be convinced by that, even if it were the goal. For another, the expected audience for these comments was not "everyone", it was specifically people who are likely to interpret those probabilities in a Bayesian way (i.e. as subjective probabilities).

      7 replies →

    • If you take it with a grain of salt it's better than nothing. In life to express your opinion sometimes the best way is to quantify that based on intuition. To make decisions you could compile multiple experts intuitive quantities and use median or similar. There are some cases where it's more straight forward and rote, e.g. in military if you have to make distance based decisions, you might ask 8 of your soldiers to each name a number they think the distance is and take the median.

    • No you’re definitely not the only one… 10% is ok, 5% maybe, 1% is useless.

      And since we’re at it: why not give confidence intervals too?

    • >Off topic, but am I the only one getting triggered every time I see a rationalist

      The rest of the sentence is not necessary. No, you're not the only one.

    • You could look at 16% as roughly equivalent to a dice roll (1 in 6) or, you know, the odds you lose a round of Russian roulette. That's my charitable interpretation at least. Otherwise it does sound silly.

    • There is no honor in hiding behind euphemisms. Rationalists say ‘low confidence’ and ‘high confidence’ all the time, just not when they're making an actual bet and need to directly compare credences. And the 16.27% mockery is completely dishonest. They used less than a single significant figure.

      2 replies →

I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.

Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.

  • And yet when working on production code current LLMs are about as good as a poor intern. Not sure why the disconnect.

    • Depends. I’ve been using it for some of my workflows and I’d say it is more like a solid junior developer with weird quirks where it makes stupid mistakes and other times behaves as a 30 year SME vet.

      2 replies →

    • It’s the same reason leet code is a bad interview question. Being good at these sorts of problems doesn’t translate directly to being good at writing production code.

    • because competitive coding is narrow well described domain(limited number of concepts: lists, trees, etc) with high volume of data available for training, and easy way to setup RL feeback loop, so models can improve well in this domain, which is not true about typical enterprise overbloated software.

      3 replies →

I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.

I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.

  • But unlike the trillion dollars invested in the broadband internet build out between 1998 and 2008, when this 10 year trillion dollar bubble pops, we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

    • It would leave a lots of general purpose GPU-based compute. That is useful and enduring infrastructure? These things are used for many scientific and engineering problems - including medicine, climate modeling, material science, neuroscience, etc

    • I think that "Query Engine" you can later distill is quite useful artefact. If I were to TP back in time I would take current LLM with me over wikipedia as it's more accessible

    • Holy shit, AIs just got a gold medal on the math olympiad and you guys are STILL spamming this shit in every thread. I don't even know how you can reach this level of inertia on a topic, did you short Nvidia stock or something?

    • >we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

      Nearly all colleagues I know working inside a very large non-tech organisation are using Copilot for part of their work in the past 12 months. I have never seen tech adoption this quick for normal every day consumer. Not PC, Not Internet, Not Smartphone.

      I actually had discussions with parents about our kids using ChartGPT. Every single one of them at school are using it. Honestly I didn't like it but they were actually the one who got used to it first and I quote "Who still uses Google?". That was when I learn there will be a tectonic shift in tech.

      Does it actually add productivity? may be. Is it worth the trillion dollar investment? I have no idea. But are we going back? As someone who knows a lot about consumer behaviour I will say that is a definite no.

      Note to myself. This feels another iPhone moment again. Except this time around lots of the tech people are skeptic of it, but consumer are adopting faster. When iPhone launch a lot of tech people knew it will be the future. But consumer took some time. Even MKBHD acknowledge his first Smartphone was in the iPhone 4s era.

    • > ... we won't be left with an enduring and useful piece of infrastructure adding a trillion dollars to the global economy annually.

      I'm not drinking the AGI kool-aid but I use LLMs daily. We pay not one but two AI subscriptions at home (including Claude).

      It's extremely useful. From translation to proof-reading to synthetizing to expanding on something to writing little dumb functions to helping with spreadsheet formulas to documenting code to writing commit messages to helping find movie names (when I only remember very partially the plot) etc.

      How is this not already adding a trillion dollars to the economy?

      It's not about the infrastructure: all that counts are the models. They're here to stay. They're not going away.

      It's the single biggest time-saver I've ever seen for mundane tasks (and, no, it doesn't write good code: it write shitty pathetic underperforming insecure code... And yet it's still useful for proofs of concept / one-offs / throwaway).

  • "worth celebrating for"

    The correlation between "companies make smarter AI" and "our lives get better" is still a rounding error.

    Many people will say "don't worry, tech always makes our lives better eventually", they'll probably stop saying this once autonomous killer drone-swarms are a thing.

This is such an interesting time because the percentage of people who are making predictions about AGI happening on the future are going to drop off and the number of people completely ignoring the term AGI will increase.

  • That doesn't seem likely because the LLMs haven't really delivered any great products that can cover the money spent and so AGI hype is essentially to keep the money flowing.

The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.

There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.

More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.

  • Yup, we have bootstrapped to enough intelligence in the models that we can introduce higher levels of ai

The Final boss was:

   Which is greater, 9.11 or 9.9?

/s

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

If someone told me this say, 10 or 20 years ago, I would have assumed this was worthy of a Nobel/Turing prize ...

  • Early machine learning researchers literally got Nobel Prize last year. Clearly not every incremental step of progress merits a Nobel.

    • Yes! But there is also a delay as 10 or 20 years ago we already had neural nets and I'm curious if people back then thought the concept was Nobel worthy.

Has anyone independently reviewed these solutions?

My proving skills are extremely rusty so I can’t look at these and validate them. They certainly are not traditional proofs though.

  • I read through P1, and it seemed to be correct. Though you could explain the central idea of the proof into about 3 sentences and a few drawings.

    It reads like someone who found the correct answer but seemingly had no understanding of what they did and just handed in the draft paper.

    Which seems odd, shouldn't an LLM be better at prose?

    • One would think. I suppose OpenAI threw the majority of their compute budget at producing and verifying solutions. It would certainly be interesting to see whether or not this new model can distill its responses to just those steps necessary to convey its result to a given audience.

I get the feeling that modern computer systems are so powerful that they can solve almost all well-explored closed problems with a properly tuned model. The problem lies in efficiency, reliability, and cost. Increasing efficiency and reliability would require an exponential increase in cost. QC might solve that cost part, and symbolic reasoning model will significantly boost both efficiency and reliability.

Definitely interesting. Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?

Second, happy to test it on open math conjectures or by attempting to reprove recent math results.

  • From what I've seen, IMO question sets are very diverse. Moreover, humans also train on all available set of math olympiad questions and similar sets too. It seems fair game to have the AI train on them as well.

    For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.

  • You mean as in the previous years questions will have been used to train it? Yes, they are the same questions and due to them limited format on math questions, there are repeats so LLMs should fundamentally be able to recognise a structure and similarities and use that.

    • They are not the same question, why are you spreading so much misinformed takes in this thread? I know a guy who had one of the best scores in history at IMO and he's incredibly intelligent. Stop repeating that getting a gold medal at IMO is a piece of cake - it's not.

Pre-registering a prediction:

When (not if) AI does make a major scientific discovery, we'll hear "well it's not really thinking, it just processed all human knowledge and found patterns we missed - that's basically cheating!"

  • Turns out goalposts are the world’s most easily moved objects. We should start building spacecraft out of them.

    • I saw the phrase "goalposts aren't just moving, they're doing parkour" recently and I do love that image. It does seem to capture the state of things quite well.

  • Less that AI is cheating and more that we basically found a way to take the thousand monkeys with infinite time scenario and condense that into a reasonable(?) amount of time and with some decent starting instructions. The AI wouldn't have done any of the heavy lifting of the discovery, it just iterated on the work of past researchers at speeds beyond human.

    • Honest question - how is that not true of those past researchers?

      IE, they...

      - Start with the context window of prior researchers.

      - Set a goal or research direction.

      - Engage in chain of thought with occasional reality-testing.

      - Generate an output artifact, reviewable by those with appropriate expertise, to allow consensus reality to accept or reject their work.

      2 replies →

    • It sounds like you're saying AI is just doing brute force with a lot of force, but I can't imagine that's actually what you think, so would you mind clarifying?

  • If you want credit for getting predictions right, you have to predict something that has less than 100% probability to happen.

  • I think both can be true - I'm pretty sure a lot of what it is viewed as genius insight by the public, is actually researchers being really familiar with the state of the art in their field and putting in the legwork of trying new ideas.

  • People get very fragile when AI is better at something than them (excluding speed/scale of operations, where computers have an obvious edge)

There is some relevant context from Terence Tao on Mathstodon:

> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".

> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:

> * One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

> * Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

> * The team leader gives the students unlimited access to calculators, computer algebra packages, textbooks, or the ability to search the internet.

> * The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

> * The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

> * Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.

> * If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.

> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

Source:

https://mathstodon.xyz/@tao/114881418225852441

https://mathstodon.xyz/@tao/114881419368778558

https://mathstodon.xyz/@tao/114881420636881657

I am quite surprised that Deepmind with MCTS wasnt able to figure out math performance itself.

  • Google will also have good results to report for this year's IMO, OpenAI just beat them to the announcement

    • I think google did some official collaboration with IMO, and will announce later. Or at least that's what I read from the IMO official saying "AI companies should wait 1 week before announcing so that we can celebrate the human winners" and "to my knowledge oai was not officially collaborating with IMO" ...

Makes sense. Mathematicians use intuiton a lot to drive their solution seeking, and I suppose an AI such as an LLM could develop intuition too. Of course where AI really wins is search speed and the fact that an LLM really doesn't get tired when exploring different strategies and steps within each strategy.

However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.

[1] https://gpt-unicorn.adamkdean.co.uk/

OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648

  • Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.

  • Somewhat related, but I’ve been feeling as of late what can best be described as “benchmark fatigue”.

    The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.

    What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?

    • Far as i can tell here, the actual advancement is in the methodology used to create a model tuned for this problem domain, and how efficient that method is. Theoretically then, making it easier to build other problem-domain-specific models.

      That a highly tuned model designed to solve IMO problems can solve IMO problems is impressive, maybe, but yeah it doesn't really signal any specific utility otherwise.

  • I don't fault you for maintaining a healthy scepticism, but per the President of the IMO: "It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced. What we can say is that correct mathematical proofs, whether produced by the brightest students or AI models, are valid." [1]

    The proofs are correct, and it's very unlikely that IMO problems were leaked ahead of time. So the options for cheating in this circumstance are that a) IMO are colluding with a few researchers at OpenAI for some reason, or b) @alexwei_ solved the problems himself - both seem pretty unlikely to me.

    [1] https://imo2025.au/wp-content/uploads/2025/07/IMO-2025_Closi...

  • On OpenAI's own released papers they show Anthropic's models performing better than their own. They tend to be pretty transparent and reliable in honesty in their benchmarks.

    The thing is, only leading AI companies and big tech have the money to fund these big benchmarks and run inference on them. As long as the benchmarks are somewhat publicly available and vetted by reputable scientists/mathematicians it seems reasonable to believe they're trustworthy.

  • Not to beat a dead horse or get into a debate, but to hopefully clarify the record:

    - OpenAI denied training on FrontierMath, FrontierMath-derived data, or data targeting FrontierMath specifically

    - The training data for o3 was frozen before OpenAI even downloaded FrontierMath

    - The final o3 model was selected before OpenAI looked at o3's FrontierMath results

    Primary source: https://x.com/__nmca__/status/1882563755806281986

    You can of course accuse OpenAI of lying or being fraudulent, and if that's how you feel there's probably not much I can say to change your mind. One piece of evidence against this is that the primary source linked above no longer works at OpenAI, and hasn't chosen to blow the whistle on the supposed fraud. I work at OpenAI myself, training reasoning models and running evals, and I can vouch that I have no knowledge or hint of any cheating; if I did, I'd probably quit on the spot and absolutely wouldn't be writing this comment.

    Totally fine not to take every company's word at face value, but imo this would be a weird conspiracy for OpenAI, with very high costs on reputation and morale.

just tried this: take the graph of the functions x^n and exp(x) how many points of intersection do they have?

chatgpt gave me the wrong answer, it claimed 2 points of intersection, but for n=4 there are 3 as one can easily derive. one for negative x and 2 points for positive x because exp(x) is growing faster than x^4.

then i corrected it and said 3 points of intersection. it said yes and gaev me the 3 points. then i said no there are 4 points of intersection and it again explained to me that there are 2 points of intersection. which is wrong.

then i asked it how many points of intersection for n=e and it said: zero

well, exp(x)=x^e for x=e, isnt it?

While this is nice for splashy headlines, I like the headlines which would read some real life usecase of math grads using AI as a companion tool for solving novel problems.

> Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

GPT-5 finally on the horizon!

"The LLM system's core mechanism is probably a "propose-verify" loop that operates on a vocabulary of special tokens representing formal logic expressions. At inference time, it first proposes a new logical step by generating a sequence of these tokens into its context window, which serves as a computational workspace. It then performs a subsequent computational pass to verify if this new expression is a sound deduction from the preceding steps. This iterative cycle, learned from a vast corpus of synthetic proof traces, allows the model to construct a complete, validated formal argument. This process results in a system with abstract reasoning capabilities and functional soundness across domains that depend on reasoning, achieved at the cost of computation required for its extended inference time."

The world is changing and it’s exciting. Either you’re on or you’re off. The world doesn’t wait.

Guys, that's nothing. My new AI system is not LLM-based but neuro-symbolic and yet it just scored 100% on the IMO 2026 problems that haven't even been written yet, it is that good.

What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".

I tried P1 on chatgpt-o4-high, it tells me the solution is k=0 or 1. It doesn’t even know that k=3 is a solution for n=3. Such a solution would get 0/7 in the actual IMO.

It’s interesting how hard and widespread a push they’re making in advertising this - at this particular moment, when there are rumors of more high level recruitment attempts / successes by Zuck. OpenAI is certainly a master at trying to drive narratives. (Independent of the actual significance / advance here). Sorry, there are too many hundreds of billions of dollars involved to not be a bit cautious and wary of claims being pushed this hard.

> level performance on the world’s most prestigious math competition

I don't know which one i would consider the most prestigious math competition but it wouldn't be The IMO. The Putnam ranks higher to me and I'm not even an American. But I've come to realise one thing and that is that high-school is very important to Americans...

  • The Putnam and IMO are quite different. I would suggest the IMO is probably harder...

    • I would disagree; the IMO depends only on late middle school/early high school level mathematics (geometry, gcd, functions) while Putnam typically depends on late high school/early college-level mathematics (integrals, limits, matrices).

      1 reply →

My issue with all these citations is that it’s all OpenAI employees that make these claims.

I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.

  • A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.

    https://matharena.ai/imo/

    • I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

      It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

      8 replies →

Also interesting takeaways from that tweet chain:

>GPT5 soon

>it will not be as good as this secret(?) model

My view is that it's less impressive than previous go and chess results. Humans are worse at competitive math than at those games, it's still very limited space and well defined problems. They may hype "general purpose" as much as they want but for now it's still the case that AI is super human at well defined limited space tasks and can't achieve performance of a mediocre below average human at simple tasks without those limitations like driving a car.

Nice result but it's just another game humans got beaten at. This time a game which isn't even taken very seriously (in comparison to ones that have professional scene).

  • The scope and creativity required for IMO is much bigger than chess/GO. Also IMO is taken VERY seriously. It's a huge deal, much bigger than any chess or go tournaments.

    • Imo competitive math (or programming) is about knowing some tricks and then trying to find a combination of them that works for a given task. The number of tricks and depth required is much less than in go or chess.

      I don't think it's very creative endeavor in comparison to chess/go. The searching required is less as well. There is a challenge processing natural language and producing solutions in it though.

      Creativity required is not even a small fraction of what is required for scientific breakthroughs. After all no task that you can solve in 30 minutes or so can possibly require that much creativity - just knowledge and a fast mind - things computers are amazing at.

      I am AI enthusiast. I just think a lot of things that were done so far are more impressive than being good at competitive math. It's a nice result blown out of proportion by OpenAI employees.

      4 replies →

  • The significance, though, is that the "very limited space and well defined problems" continue to expand. Moving from a purpose built system for playing a single game, vs. having a system that can address a broader set of problems would still be a significant step - as more high value tasks will fall into it's competency range. It seems the next big step will be on us to improve eval/feedback systems in less defined problems.

Its a level playing field IMO. But theres another thread which claims not even bronze and I really don't want to go to X for anything.

  • I can save you the click. Public models (gemini/o3) are less than bronze. this is a specially trained model which is not publicly available.

The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust. As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.

  • I still haven’t forgotten OpenAI’s FrontierMath debacle from December. If they really have some amazing math-solving model, give us more info than a vague twitter hype-post.

  • > The issue is that trust is very hard to build and very easy to lose

    I think its opposite: general public blindly trusts all kind of hyped stuff, its a very few hyper-skeptical who are some fraction of percent of population.

  • Especially since they are saying they don't plan to release this kind of model anytime soon.

I like how they always say AI will advance science when they want to sell it to the public, but pump how it will replace workers when selling it to businesses. It’s like dangling a carrot while slowly putting a knife to our throats.

Edit: why was my comment moved from the one I was replying to? It makes no sense here on its own.

I don't know how much novelty should you expect from IMO every year but i expect many of them be variation of the same problem.

These models are trained on all old problem and their various solutions.For LLM models, solving thses problems are as impressive as writing code.

There is no high generalization.

  • You should expect quite a bit of novelty from the IMO, given the constraint of high school level curriculum. The problem setters work very hard to avoid problems that are variations of other contests or solvable by routine methods. That's why this is a very exciting result--you can't just regurgitate homework problem solutions to get a high score at the IMO.

The 4 hours time limit is a bit of an odd metric, given that OpenAI have effectively an unlimited amount of compute at their disposal. If they’re running the model on 100,000 GPUs for 4hrs, that’s obviously going to have better outcomes than running it on 5.

This is an awesome progress in human achievement to get these machines intelligent. And this is also a fast regress and decline on the human wisdom!

We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?

Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?

  • Nobody knows the answers to these questions. Relying on AGI solving problems like climate change seems like a risky strategy but on the other hand it’s very plausible that these tools can help in some capacity. So we have to build, study and find out but also consider any opportunity cost of building these tools versus others.

    • Solving climate change isn't a technical problem, but a human one. We know the steps we have to take, and have for many years. The hard part is getting people to actually do them.

      No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.

      3 replies →

>AI model performs astounding feat everyone claimed was impossible or won’t be achieved for a while

>Commenters on HN claim it must not be that hard, or OpenAI is lying, or cheated. Anything but admit that it is impressive

Every time on this site lol. A lot of people here have an emotional aversion to accepting AI progress. They’re deep in the bargaining/anger/denial phase.

  • I’ve been thinking a lot about what AI means about being human. Not about apocalypses or sentience or taking our jobs, but about “what is a human” and “what is the value of a human”.

    All my life I’ve taken for granted that your value is related to your positive impact, and that the unique value of humans is that we can create and express things like no other species we’ve encountered.

    Now, we have created this thing that has ripped away many of my preconceptions.

    If an AI can adequately do whatever a particular person does, then is there still a purpose for that person? What can they contribute? (No I am not proposing or even considering doing anything about it).

    It just makes me sad, like something special is going away from the world.

    • The fact that you're honestly grappling with this reality puts you far ahead of most people.

      It seems a common recent neurosis (albeit protective one) to proclaim a permanent human preeminence over the world of value, moral status and such for reasons extremely coupled with our intelligence, and then claim that certain kinds of intelligence have nothing to do with it when our primacy in those specific realms of intelligence is threatened. This will continue until there's nothing humans have left to bargain with.

      The world isn't what we want it to be, the world is what it is. The closest thing we have to the world turning out the way we want it making it that way. Which is why I think many of those who hate AI would give their desires for how the world ought to be a better fighting chance by putting in the work to making it so, rather than sitting in denial at what is happening in the world of artificial intelligence.

      1 reply →

  • Sort of a naive response, considering many of the folks calling out the issues have significant experience building with LLMs or building LLMs.

    • Denying the rapid improvement in AI is the only naivety that really matters in the long run at this point. I haven’t seen much substantive criticism of this achievement that boils down to anything more than “well it’s a secret model so they must not be telling us something”

    • I'm building with LLMs, and they're solving problems that weren't possible to solve before due to how many resources they would consume. Resources, as in human-hours.

      Finance, chemistry, biology, medicine.

  • A problem as old as the field. Quote from the 80s:

    > There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

  • I guess my major question would be: does the training data include anything from 2025 which may have included information about the IMO 2025?

    Given that AI companies are constantly trying to slurp up any and all data online, if the model was derived from existing work, it's maybe less impressive than at first glance. If present-day model does well at IMO 2026, that would be nice.

  • Get ready to downvoted, the wave of single minded is coming

    • There is a diversity of opinions on this site. I do hope that soon more of the intelligent commenters who have spent a while denying AI progress will realize what’s actually happening and contribute their brainpower to a meaningful cause in the lead up to AI-human parity. If we want a good future in a world where AI is smarter than humans, we need to do alignment work soon.

The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models

I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain

  • Cynicism and denial are two very different things, and have very different causes and justifications. I personally don't deny that LLMs are very powerful and are capable and capable of eliminating many jobs. At the same time I'm very cynical about the rollout and push for AI. I don't see in any way as a push for a "better" society or towards some notion of progress, but rather an enthusiastic effort to disempower employees, centralize power, expand surveillance, increase profits, etc.

    • AI is kerosene. A useful resource when applied with reason and compassion. Late stage capitalism is a dumpster full of garbage. AI in combination with late stage capitalism is a dumpster fire. Many, perhaps most people conflate the dumpster fire with "kerosine evil!"

      2 replies →

  • Making an account just to point out how these comments are far more exhausting, because they don't engage with the subject matter. They are just agreeing with a headline and saying, "See?"

    You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.

  • Probably because both sides have strong vested interests and it’s next to impossible to find a dispassionate point of view.

    The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Many tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.

    • That's a huge hyperbole. I can assure you many people find the entire thing genuinely fascinating, without having any vested interest and without buying the hype.

      1 reply →

    • That's just another way to state that everybody is almost always self-serving when it comes to anything.

    • Or some can spot a euphoric bubble when they see it with lots of participants who have over-invested in 90% of these so called AI startups that are not frontier labs.

      5 replies →

  • Accepting openai at face value is just the lazy stance.

    Finding a critic perspective and try to understand why it can be wrong is more fun. You just say "I was wrong" when proved wrong.

  • Two things can happen at the same time: Genuine technological progress and the “hype machine” going into absolute overdrive.

    The problem with the hype machine is that it provokes an opposite reaction and the noise from it buries any reasonable / technical discussion.

  • > I've been reading this website for probably 15 years, its never been this bad.

    People here were pretty skeptical about AlexNet, when it won the ImageNet challenge 13 years ago.

    https://news.ycombinator.com/item?id=4611830

  • Indeed it's a very unsophisticated and obnoxious audience around here. They are so conservative and unadventurous that anything possibly transformative is labeled as hype. I feel bad for them.

  • Enthusiastically denouncing or promoting something is much, much easier and more rewarding in the short term for people who want to appear hip to their chosen in-group - or profit center.

    And then, it's likewise easy to be a reactionary to the extremes of the other side.

    The middle is a harder, more interesting place to be, and people who end up there aren't usually chasing money or power, but some approximation of the truth.

  • I agree that there's both cynicism and denial, but when I've explained my views I have usually been able to get through to the complainers.

    Usually my go-to example for LLMs doing more than mass memorization is Charton's and Lample's LLM trained on function expressions and their derivatives and which is able to go from the derivatives to the original functions and thus perform integration, but at the same time I know that LLMs are essentially completely crazy with no understanding of reality-- just ask them to write some fiction and you'll have the model outputting discussions where characters who have never met before are addressing each other by name, or getting other similarly basic things wrong, and when something genuinely is not in the model you will end up in hallucination land. So the people saying that the models are bad are not completely crazy.

    With the wrong codebase I wouldn't be surprised if you need a finetune.

  • It's caught in a kind of feedback loop. There are only so many times you can see "stochastic parrot" or "fancy autocomplete" or "can't draw hands" or "just a bunch of matmuls, it can't replicate the human soul" lines before you decide to just not engage. This leads to more of the content being exactly that, driving more people away.

    At this point, there are much better places to find technical discussion of AI, pros and cons. Even Reddit.

    • yeah alot of the time now, i will draft a comment, and then not even publish it. like whats the point

  • This sounds like a version of "HN hates X and I am tired of it". In last 10 years or so I have been reading HN, X has been crypto, Musk/Tesla and many more.

    So, as much I get the frustration comments like these don't really add much. Its complaining about others complaining. Instead this should be taken as a signal that maybe HN is not the right forum to read about these topics.

    • GP is exaggerating, but this thread in particular is really bad.

      It's healthy to be skeptical, and it's even healthier to be skeptical of openai, but there are commenters who clearly have no idea of what IMO problems are saying that this means nothing somehow?

  • Its obvious why though. The typical "tech" culture values human ingenuity, creativity, intelligence and agency due to its history. Someone coming up with a new algorithm in their garage can build a billion dollar business - it is a indie hacker culture that historically valued "human intelligence".

    i.e. it is a culture of meritocracy; where no matter your social connections, political or financial capital if you are smart and driven you can make it.

    AI flips that around. It devalues human intelligence and moves the moats to the ol' school things of money, influence and power. The big winners are no longer the most hard working, or above average intelligence. Intelligence is devalued; as a wealthy person I now have intelligence at my fingertips making it a commodity rather than a virtue - but money, power and connections - that's now the moat.

    If all you have is your talent the future could look quite scary in an AI world long term. Money buys the best models, connections, wealth and power become the remaining moats. This doesn't gel typically in a "indie hacker" like culture in most tech forums.

  • Makes sense. Everyone here has their pride and identity tied to their ability to code. HN likes to upvote articles related to IQ because coding correlates with IQ and HNers like to think they are smart.

    AI is of course a direct attack on the average HNers identity. The response you see is like attacking a Christian on his religion.

    The pattern of defense is typical. When someone’s identity gets attacked they need to defend their identity. But their defense also needs to seem rational to themselves. So they begin scaffolding a construct of arguments that in the end support their identity. They take the worst aspects of AI and form a thesis around it. And that becomes the basis of sort of building a moat around their old identity as an elite programmer genius.

    Tell tale sign you or someone else is doing this is when you are talking about AI and someone just comments about how they aren’t afraid of AI taking over their own job when it wasn’t even directly the topic.

    If you say like ai is going to lessen the demand for software engineering jobs the typical thing you here is “I’m not afraid of losing my job” and I’m like bro, I’m not talking about your job specifically, I’m not talking about you or your fear of losing a job I’m just talking about the economics of the job market. This is how you know it’s an identity thing more than a technical topic.

  • Based on the past history with frontier-math & AIME 2025 [1],[2] I would not trust announcements which cant be independently verified. I am excited to try it out though.

    Also, the performance of LLMs on imo 2025 was not even bronze [3].

    Finally, this article shows that LLMs were just mostly bluffing [4] on usamo 2025.

    [1] https://www.reddit.com/r/slatestarcodex/comments/1i53ih7/fro...

    [2] https://x.com/DimitrisPapail/status/1888325914603516214

    [3] https://matharena.ai/imo/

    [4] https://arxiv.org/pdf/2503.21934

  • Basically this. Not sure why people here love to doubt AI progress as it clearly makes strides

    • because per corps statements, AI are now top 0.1% of PhD in math, coding, physics, law, medicine etc, yet, when I try it myself for my work it makes stupid mistakes, so I have suspicion that corp very pushy on manipulating metrics/benchmarks.

    • I don't doubt the genuine progress in the field (from like, a research perspective) but my experience with commercial LLM products comes absolutely nowhere close to the hype.

      It's reasonable to be suspicious of self aggrandizing claims from giant companies hyping a product, and it's hard not to be cynical when every forced AI interaction (be it Google search or my corporate managers or whatever) makes my day worse.

  • HN feels very low signal, since it's populated by people who barely interact with the real world

    X is higher signal, but very group thinky. It's great if you want to know the trends, but gotta be careful not to jump off the cliff with the lemmings.

    Highest signal is obviously non digital. Going to meetups, coffee/beers with friends, working with your hands, etc.

    • it used to be high signal though. you have to wonder if the type of people posting on here is different than it used to be

  • Meh. Some over hype, some under hype. People like you whine and then don't want to listen to any technical concerns.

    Some of us are implementing things in relation to AI so we know it's not about "increasing performance of models" but actual about the right solution for the right problem.

    If you think Twitter has "educated takes" then maybe go there and stop being pretentious schmuck over here.

    Talent drain, lol. I'd much rather have skeptics and good tips than usernames, follows and social media engagement.

    • Both sides are not equally wrong, clearly. Until yesterday prediction markets were saying the probability of an AI getting a gold medal in IMO in 2025 was <20%. So clearly we should be more hyped, not less.

      1 reply →

  • It may be a talent drain too, but at least it's a selection bias. People just get enough and go away, or don't comment. At the extreme, that leads to a downward spiral in the epistemology of the site. Look at how AI fares in Bluesky.

    As a partially separate issue, there are people trying to punish comments quoting AI by downvotes. You don't need to have a non-informative reply, just sourcing it to AI is enough. A random internet dude telling the same thing with less justification or detail is fine to them.

  • Its because hackers are fed up of being conned by corporations that steal our code, ideas, data. They start out "open" only to rug pull. "Pissing in the pool of opensource".

    As hackers we have more responsibility than the general public because we understand the tech and its side effects, we are the first line of defense so it is important to speak out not only to be on the right side of history but also to protect society.

  • It’s the same with anything related to cryptocurrency. HN has a hate boner for certain topics.

  • The overconfidence/short sightedness on HN about AI is exhausting. Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

    • > Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

      I do not see that at all in this comment section.

      There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.

      4 replies →

    • I went through the thread and saw nothing that looked like this.

      I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.

      I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.

      9 replies →

    • I don’t typically find this to be true. There is a definite cynicism on HN especially when it comes to OpenAI. You already know what you will see. Low quality garbage of “I remember when OpenAI was open”, “remember when they used to publish research”, “sama cannot be trusted”, it’s an endless barrage of garbage.

      3 replies →

    • Nobody likes the idea that this is only "economical superior AI". Not as good as humans, but a LOT cheaper.

      The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.

      The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.

      The current model of LLMs are enshitification accelerators and that will have real effects.

  • > I've been reading this website for probably 15 years, its never been this bad... all the actual educated takes are on X

    Almost every technical comment on HN is wrong (see for example essentially all the discussion of Rust async, in which people keep making up silly claims that Rust maintainers then attempt to patiently explain are wrong).

    The idea that the "educated" takes are on X though... that's crazy talk.

    • With regard to AI & LLMs Twitter/x is actually the only place with all of the industry people discussing.

      There are a bunch of great accounts to follow that are only really posting content to x.

      Karpathy, nearcyan, kalomaze, all of the OpenAI researchers including the link this discussion is on, many anthropic researchers. It's such a meme that you see people discuss reading Twitter thread + paper because the thread gives useful additional context.

      Hn still has great comment sections on maker style posts, on network stuff, but I no longer enjoy the discussions wrt AI here. It's too hyperbolic.

      7 replies →

    • This is true of every forum and every topic. When you actually know something about the topic you realize 90% of the takes about it are garbage.

      But in most other sites the statistic is 99%, so HN is still doing much better than average.

    • No on AI, this is really a fringe environment of relatively uninformed commenters, compared to X. X has its craziness but you can curate your feeds by using lists. Here I can't choose who to follow.

      And like said, the researchers themselves are on X, even Gary Marcus is there. ;)

  • Software that mangles data on the regular should be thrown away.

    How is it rational to 10x the budget over and over again when it mangles data every time?

    The mind blowing thing is not being skeptical of that approach, it's defending it. It has become an article of faith.

    It would be great to have AI chatbots. But chatbots that mangle data getting their budgets increased by orders of magnitude over and over again is just doubling down on the same mistake over and over again.

  • HN doesn't have a strong enough protection against bots, so foreign influence campaign bots with the goal of spreading negative sentiment about American technology companies are, I believe, very common here.

And of course it's available even in Icelandic, spoken by ~300k people, but not a single Indian language, spoken by hundreds of millions.

भारत दुर्दशा न देखी जाई...

  • Presumably almost all competitors from India would be fluent in English (given it is the second most spoken language there)? I guess the same is true of Icelandic though.

    • Yes, and there's also languages of ex-USSR countries, whose competitors presumably all understand Russian, and so on.

      The real reason might be that there's an enormous class of self-loathing elites in India who actively despise the possibility of any Indian language being represented in higher education. This obviously stunts the possibility of them being used in international competitions.

      8 replies →

I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.

Now it is just doing a bunch of tweets?

  • thats when they were a real research company. the last proper research they did was instructGPT, everything since has been product development and following others. the reputation hit hasnt caught up with them because sam altman has built a whole career out of outrunning the reputation lag

Am I missing something or is this completely meaningless? It's 100% opaque, no details whatsoever and no transparency or reproducibility.

I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.

[dead]

  • > such announcements should wait at least a week after the closing ceremony

    it would raise more concerns that corps leaked questions/answers to training data and finetuned specialized models during this time.

[flagged]

  • > I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally,

    OpenAI did not participate in the actual competition nor were they taking spots away from humans. OpenAI just gave the problems to their AI under the same time limit and conditions (no external tool use)

    > nor put any engineers under duress of having to pull all-nighters.

    Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.

  • As far as I can tell, OpenAI didn't participate, and isn't claiming they participated. Note the fairly precise phrasing of "gold medal-level performance": they claim to have shown performance sufficient for a gold, not that they won one.

    • > they claim to have shown performance sufficient for a gold

      This sounds very like Ferrari claiming that their cars can drive fast enough to get gold in the Olympic games 100 meter sprint.

      1 reply →

  • - AI competing is "wholly unfair"

    - "[AI is] far away from being substantially being better than MCTs"

    ^ pick only one

    • Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).

      Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.

      1 reply →

[flagged]

  • > high school/early university maths problems should not have been a stretch at all for it.

    Either you are unfamiliar with the International Math Olympiad or you’re trying to be misleading.

    Calling these problems high school/early university maths is a ridiculous characterization.

huh?

any details?

  • [flagged]

    • Which would be impressive if we knew those problems weren't in the training data already.

      I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.

      But we must wary of mixing up smart information retrieval with reasoning.

      3 replies →

99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.

Not even bronze.

https://news.ycombinator.com/item?id=44615695

  • From the article:

    > Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

  • This is a new model. Those tests were with what's currently publicly available.

    • Ah true.

      Although it's the benchmark that is publicly available. The model is not.

I fed the problem 1 solution into gemini and asked if it was generated by a human or llm. It said:

Conclusion: It is overwhelmingly likely that this document was generated by a human.

----

Self-Correction/Refinement and Explicit Goals:

"Exactly forbidden directions. Good." - This self-affirmation is very human.

"Need contradiction for n>=4." - Clearly stating the goal of a sub-proof.

"So far." - A common human colloquialism in working through a problem.

"Exactly lemma. Good." - Another self-affirmation.

"So main task now: compute K_3. And also show 0,1,3 achievable all n. Then done." - This is a meta-level summary of the remaining work, typical of human problem-solving "

----