Comment by InkCanon
9 months ago
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
Yes, here's the link: https://arxiv.org/abs/2503.21934v1
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
Half the researchers are at ETH Zurich (INSAIT is a partnership between EPFL, ETH and Sofia) - hardly an unreliable institution.
[dead]
In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.
music theory is a really good test because in my experience the AI is extremely bad at it
> In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.
This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.
LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.
1 reply →
Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).
Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
16 replies →
> This effectively makes LLMs useless for education.
No. You're only arguing LLMs are useless at regurgitating homework assignments to allow students to avoid doing it.
The point of education is not mindless doing homework.
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
8 replies →
To be fair, I love that magicians can pull tricks on me even though I know it is fake.
2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
"That's impossible because..."
"Good point! Blah blah blah..."
Absolutely shameless!
I just asked my company-approved AI chatbot the same question.
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
2 replies →
Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
4 replies →
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
It'll get it right next time because they'll hoover up the parent post.
This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
I've seen humans make exactly these sorts of mistakes?
As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
A lot of humans are similarly good at some stuff and bad at other things.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
I had to look up these acronyms:
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.
https://openai.com/index/learning-to-reason-with-llms/
The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
AIME is so not IMO.
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
> they're more like calculators of language than agents that reason
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.
The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.
And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability
1 reply →
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.
3 replies →
I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.
Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
>I'm incredibly surprised no one mentions this
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.
And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
> within a week
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
13 replies →
New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.
And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.
They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.
LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
This is simply using LLMs directly. Google has demonstrated that this is not the way to go when it comes to solving math problems. AlphaProof, which used AlphaZero code, got a silver medal in last year's IMO. It also didn't use any human proofs(!), only theorem statements in lean, without their corresponding proofs [1].
[1] https://www.youtube.com/watch?v=zzXyPGEtseI
Query: Could you explain the terminology to people who don't follow this that closely?
Not the OP but
USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
IMO : International Math Olympiad
SOTA : State of the Art
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess
Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/
Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
Ick, OK, ACK.
Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?
https://github.com/adamkarvonen/chess_gpt_eval
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
5 replies →
3.5 turbo instruct is a huge outlier.
https://news.ycombinator.com/item?id=42138289
11 replies →
Eek! You mean eke.
Because of the vast number of problems reused, removing those data from training sets will just make models worse. Why would anyone do it?
That type of news might make investors worry / scared.
Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.
2.5 pro scores 25%.
It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.
Bold claim! Let's see what that 25% is. I guarantee it is the portion of the exam which is trivially answerable if you have a stored database of all previous math exams ever written to consult.
2 replies →
Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!
What would the average human score be?
I.e. if you randomly sampled N humans to take those tests.
The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.
I am hesitant to correct a math Olympian, but don't you mean the median?
1 reply →
Average, hmmm?
This is a disappointing answer from an MO alum. Pick a quantile, any quantile...