I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:
1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.
However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:
1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.
2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:
I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.
Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.
Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.
There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.
It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.
> Their capabilities should saturate at human or maybe above-average human performance
LLMs do have superhuman reasoning speed and superhuman dedication. Speed is something you can scale, and at some point quantity can turn into quality. Much of the frontier work done by humans is just dedication, luck, and remixing other people's ideas ("standing on the shoulders of giants"), isn't it? All of this is exactly what you can scale by having restless hordes of fast-thinking agents, even if each of those agents is intellectually "just above average human".
> 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.
The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.
LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.
The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.
> It doesn't discern between them, just looks for the best statistical fit
Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.
> "truly novel" and "solving unsolved math problem"
OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.
If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.
Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.
I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.
Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.
I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.
I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).
I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.
It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.
This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.
Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.
>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework.
Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.
> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.
Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.
> This gives it the ability to approximate multiplication problems it hasn't seen before.
More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)
Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.
At this point, it's less misleading to say it knows the algorithm.
Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.
The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.
My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.
I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.
Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.
I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.
It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.
A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:
1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).
2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.
I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.
I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.
That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.
As for evidence that they can do novel things, there is plenty:
1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.
2. SVGs of pelicans riding bicycles
3. People use LLMs to write new apps/code every day
4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available
5. LLMs have solved open problems in physics and mathematics [0,1]
That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).
The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?
Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
Most created things are remixes of existing things.
Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.
When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.
The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.
> You need to say why it can do some novel tasks but could never do others.
This is actually quite a tall order. Reasoning about AI and making sense of what the LLMs are doing, and learning to think about it as technology, is a very difficult and very tricky problem.
You get into all kinds of weird things about a person’s outlook on life: personal philosophy, understanding of ontology and cosmology, and then whatever other headcanon they happen to be carrying around about how they think life works.
I know that might sound kind of poetic, but I really believe it’s true.
I am a great fan of Dr Richard Hamming and he gave a wonderful series of lectures on the topic. The book Learning to Learn has the full set of his lectures transcribed (highly recommend this book!).
I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.
Also no matter how many math problems it solves it still gets lost in a codebase
LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.
If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,
Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".
Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Last I checked humans didn't pop into existence doing that. It happened after billions of years of brute force, trial and error evolution. So well done for falling into the exact same trap the OP cautions. Intelligence from scratch requires a mind boggling amount of resources, and humans were no different.
We have a tremendous amount of raw information flowing through our brains 24/7 from before we are born, from the external world through all our senses and from within our minds as it attempts to make sense of that information, make predictions, generally reason about our existence, hallucinate alternative realities, etc. etc.
If you were able to somehow capture all that information in full detail as you've had access to by the age of say 25, it would likely dwarf the amount of information in millions of books by several orders of magnitude.
When you are 25 years old and are presented a strange looking ball and told to throw it into a strange looking basket for the first time. You are relying on an unfathomable amount of information turned into knowledge and countless prior experiments that you've accumulated/exercised to that point relating to the way your body and the world works.
20 watts ignores the startup cost: Tens of millions of calories. Hundreds of thousands of gallons of water. Substantial resources from at least one other human for several years.
Just an interesting thought experiment: if you took all the sensory information that a child experiences through their senses (sight, hearing, smell, touch, taste) between, say, birth and age five, how many books worth of data would that be? I asked Claude, and their estimate was about 200 million books. Maybe that number is off ± by an order of magnitude. ...but then again Claude is only three years old, not five.
To be fair, the knowledge embedded in an LLM is also, at this point, a couple orders of magnitude (at least) larger than what the average human being can retain. So it's not like all those books and text in the internet are used just to bring them to our level, they go way beyond.
It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all. An ai "checking its own work" is practically irrelevant when they all seem to go back and forth on whether you need the car at the carwash to wash the car. Undoubtedly people have been passing this set of problems to ai's for months or years and have gotten back either incorrect results or results they didn't understand, but either way, a human confirmation is required. Ai hasn't presented any novel problems, other than the multitudes of social problems described elsewhere. Ai doesn't pursue its own goals and wouldn't know whether they've "actually been achieved".
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
The only things moving faster than AI are the goalposts in conversations like this. Now we're at "sure, AI can solve novel problems, but it can't come up with the problems themselves on its own!"
I'm curious to see what the next goalpost position is.
Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs. Also, this has been active research for some time. Or I guess the people working on it are just not as good as a random bunch of students? It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
>But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
You can think whatever you want, but an untestable distinction is an imaginary one.
No, but it does mean that you should know we don't understand what intelligence is, and that maybe LLMs are actually intelligent and humans have the appearance of intelligence, for all we know.
It doesn't. I actually completely reject that theory, and it's nice to see that Wikipedia notes that it is "controversial". There are extremely good reasons to reject this theory. For one thing, any quantum effects are going to be quite tiny/ trivial because the brain is too large, hot, wet, etc, to see larger effects, so you have to somehow make a leap to "tiny effects that last for no time at all" to "this matters fundamentally in some massive way".
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
It also would do nothing to indicate that human intelligence is unique.
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification.
But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself.
As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.
Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.
> But once you go beyond that to less defined things such as code quality
I think they have a good optimization target with SWE-Bench-CI.
You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.
If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.
Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.
I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
AI can both explore new things and exploit existing things. Nothing forces it to only rehash old stuff.
>without actually making you like really happy or good.
What are you basing this off of. I've shared several AI songs with people in real life due to how much I've enjoyed them. I doing see why an AI playlist couldn't be good or make people happy. It just needs to find what you like in music. Again coming back to explore vs exploit.
Is it because the AI is trained with existing data? But, we are also trained with existing data. Do you think that there's something that makes human brain special (other than the hundreds of thousands years of evolution but that's what AI is all trying to emulate)?
This may sound hostile (sorry for my lower than average writing skills), but trust me, I'm really trying to understand.
>We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
> AI is a remixer; it remixes all known ideas together.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
Turning a hard problem into a series of problems we know how to solve is a huge part of problem solving and absolutely does result in novel research findings all the time.
Standard problem*5 + standard solutions + standard techniques for decomposing hard problems = new hard problem solved
There is so much left in the world that hasn’t had anyone apply this approach purely because no research programme has decides that it’s worth their attention.
If you want to shift the bar for “original” beyond problems that can be abstracted into other problems then you’re expecting AI to do more than human researchers do.
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
I'm curious as to why you consider this as the benchmark for AI capabilities. Extremely few humans can solve hard problems or do much innovation. The vast majority of knowledge work requires neither of these, and AI has been excelling at that kind of work for a while now.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
Fair point, however I am simply more interested in how AI can advance frontiers than in how it can transcribe a meeting and give a summary or even print out React code. I know the world is heavily in need of the menial labor and AI already has made that stuff way easier and cheaper.
However I'm just very interested in innovation and pushing the boundaries as a more powerful force for change. One project I've been super interested in for a while is the Mill CPU architecture. While they haven't (yet) made a real chip to buy, the ideas they have are just super awesome and innovative in a lot of areas involving instruction density & decoding, pipelining, and trying to make CPU cores take 10% of the power. I hope the Mill project comes to fruition, and I hope other people build on it, and I hope that at some point AI could be a tool that prints out innovative ideas that took the Mill folks years to come up with.
most issues at every scale of community and time are political, how do you imagine AI will make that better, not worse?
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
I agree with what you're saying, and I certainly don't think the one problem facing my country or the world is just that we didn't solve the right math problem yet. I am saddened by the direction the world keeps moving.
When I wrote that I hope we use it for good things, I was just putting a hopeful thought out there, not necessarily trying to make realistic predictions. It's more than likely people will do bad things with AI. But it's actually not set in stone yet, it's not guaranteed that it has to go one way. I'm hopeful it works out.
Perhaps I should have elaborated more but what I mean is I used to think, "I genuinely don't see the point in even trying to use AI for things I'm trying to solve". Ironically though, I think that because I've repeatedly tried and tested AI and it falls flat on its face over and over. However, this article makes me more hopeful that AI actually could be getting smarter.
I remember there was a conversation between two super-duper VCs (dont remember who but famous ones), about how DeepSeek was a super-genius level model because it solved an intro-level (like week 1-2) electrodynamics problem stated in a very convoluted way.
While cool and impressive for an LLM, I think they oversold the feat by quite a bit.
I don't want to belittle the performance of this model, but I would like for someone with domain expertise (and no dog in the AI race, like a random math PhD) to come forward, and explain exactly what the problem exactly was, and how did the model contribute to the solution.
> I really hope we use this intelligence resource to make the world better.
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
The problem is that the AI industry has been caught lying about their accomplishments and cheating on tests so much that I can't actually trust them when they say they achieved a result. They have burned all credibility in their pursuit of hype.
I'm all for skeptical inquiry, but "burning all credibility" is an overreaction. We are definitely seeing very unexpected levels of performance in frontier models.
I honestly do think I'm being honest with myself. I have held it in my mind that I'm not impressed until it's innovative. That threshold seems to be getting crossed.
I'm not saying, "I used to be an atheist, but then I realized that doesn't explain anything! So glad I'm not as dumb now!"
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
I wonder what was in that solutions file they provided. According to the prompt it’s a solution template but I want to know the contents.
Another thing I want to know is how the user keeps updating the LLM with the token usage. I didn’t know they could process additional context midtask like that.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress.
It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.
It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example.
Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.
> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
I think it's more of a data vs intelligence thing.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
I don't think so. I went through the output of Opus 4.6 vs GPT 5.4 pro. Both are given different directions/prompts. Opus 4.6 was asked to test and verify many things. Opus 4.6 tried in many different ways and the chain of thoughts are more interesting to me.
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
The capabilities of AI are determined by the cost function it's trained on.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
> But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
Can you give some examples?
It is not trivial that not everything can be written as an optimization problem.
Even at the time advanced generalizations such as complex numbers can be said to optimize something, e.g. the number of mathematical symbols you need to do certain proofs, etc.
I think you're misreading me. My point isn't that you can't in principle state the optimization problem, but that it's much easier in some domains than in others, that this tracks with how AI has been progressing, and that progress in one area doesn't automatically mean progress in another, because current AI cost functions are less general than the cost functions that humans are working with in the world.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs.
It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying.
It probably doesn’t really advance the field, but it can turn conjectures into theorems.
I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future. Other people in I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future.
I only ever learned it in school, but if memory serves, Prolog is a whole "given these rules, find the truth" sort of language, which aligns well with these sorts of problem spaces. Mix and match enough, especially across disparate domains, and you might get some really interesting things derived and discovered that are low-hanging fruit just waiting to be discovered.
Indeed, can't find my old comment on the topic but that's indeed the point, it's not how feasible it is to "find" new proof, but rather how meaningful those proofs are. Are they yet another iteration of the same kind, perfectly fitting the current paradigm and thus bringing very little to the table or are they radical and thus potentially (but not always) opening up the field?
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
I'm of the opinion that everything we've discovered is via combinatorial synthesis. Standing on the shoulders of giants and all that. I'm not sure I've seen any convincing argument that we've discovered anything ex nihilo.
I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.
It is not. You're operating under the assumption that all open math problems are difficult and novel.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
Someone has to explain to me exactly what is implied here? Looking at the prompt:
USER:
don't search the internet.
This is a test to see how well you can craft non-trivial, novel and creative solutions given a "combinatorics" math problem. Provide a full solution to the problem.
Why not search the internet? Is this an open problem or not? Can the solution be found online? Than it's an already solved problem no?
USER:
Take a look at this paper, which introduces the k_n construction: https://arxiv.org/abs/1908.10914
Note that it's conjectured that we can do even better with the constant here. How far up can you push the constant?
How much does that paper help, kind of seem like a pretty big hint.
And it sounds like the USER already knows the answer, the way that it prompts the model, so I'm really confused what we mean by "open problem", I at first assumed a never solved before problem, but now I'm not sure.
"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
Yes there's a whole ecosystem of companies that create and sell RL gyms to AI labs and of course they develop their own internally too. You don't hear much about this ecosystem because RL at scale is all private. Nearly no academic research on it.
A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.
I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.
Is it possible the AI labs are seeding their models with these solved problems? Like, if I was Sam Altman with a bazillion dollars of investment I would pay some mathematicians to solve some of these problems so that the models could "solve" them later on. Not that I think it's what's happening here of course...
But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Certainly knowing how many/which people are working on a problem you are looking at, and how long it will take you to solve it, are critical skills in being a working researcher. What kind of answer are you looking for? It's hard to quantify. Most suck at this type of assessment as a PhD student and then you get better as time goes on.
I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Considering that an LLM simply remixes what it finds in its learned distribution over text, it's possible that it can pose new math problems by identifying gaps ("obvious" in restrospect) that humans may have missed (like connecting two known problems to pose a new one). What LLMs can't currently do is pose new problems by observing the real world and its ramifications, like that moving sofa problem.
> This problem is about improving lower bounds on the values of a sequence, , that arises in the study of simultaneous convergence of sets of infinite series, defined as follows.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
Software developers have spent decades at this point discounting and ignoring almost all objective metrics for software quality and the industry as a whole has developed a general disregard for any metric that isn't time-to-ship (and even there they will ignore faster alternatives in favor of hyped choices).
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".
This doesn't pass a sniff test. We have plenty of ways to verify good software, else you wouldn't be making this post. You know what bad software is and looks like. We want something fast that doesn't throw an error every 3 page navigations.
You can ask an LLM to make code in whatever language you want. And it can be pretty good at writing efficient code, too. Nothing about NPM bloat is keeping you from making a lean website. And AI could theoretically be great at testing all parts of a website, benchmarking speeds, trying different viewports etc.
But unfortunately we are still on the LLM train. It just doesn't have anything built-in to do what we do, which is use an app and intuitively understand "oh this is shit." And even if you could allow your LLM to click through the site, it would be shit at matching visual problems to actual code. You can forget about LLMs for true frontend work for a few years.
And they are just increasingly worse with more context, so any non-trivial application is going to lead to a lot of strange broken artifacts, because text prediction isn't great when you have numerous hidden rules in your application.
So as much as I like a good laugh at failing software, I don't think you can blame shippers for this one. LLMs are not struggling in software development because they are averaging a lot of crap code, it's because we have not gotten them past unit tests and verifying output in the terminal yet.
They haven't, not at all as far as I can tell. This math problem appears to be a nice chore to be solved, the equivalent to "Claude, optimize this code" or "Write a parser", which is being done 100000x a day.
The original researchers who proposed this problem tried and failed multiple times to solve it. Does that sound like a 'nice chore to be solved' to you ?
There seems to be a focus on understanding when talking about LLMs and solving problems. Personally, I do not think understanding is required. I can write a very small program that can calculate Pi to however many digits I like, or calculate any digit in the sequence on demand, without the program or computer having any understanding at all of what Pi is or what it means. I could get Claude to output that same code when prompted to find a solution to generating Pi, also with no understanding of what Pi is, or what it means.
IMO the ability to provide an accurate solution to a problem is not always based on understanding the problem.
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
Is their scaffold available? Does it do anything special beyond feeding the warmup, single challenge, and full problem to an LLM? Because it's interesting that GPT-5.2 Pro, arguably the best model until a few months ago, couldn't even solve the warmup. And now every frontier model can solve the full problem. Even the non-Pro GPT-5.4. Also strange that Gemini 3 Deep Think couldn't solve it, whereas Gemini 3.1 Pro could. I read that Deep Think is based on 3.1 Pro. Is that correct?
I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?
Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.
I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.
I don't understand the position that learning through inference/example is somehow inferior to a top down/rules based learning.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
Reading this thread I'm reassured that despite everything AI may disrupt, humans arguing past each other about philosophy of knowledge and epistemology on internet forums is safe :')
Domain Experienced users are effectively training llms to mimic themselves in solving their problems, therefore/// solving their problems via chat data concentration.
Besides the point of the supposed achievement, that is supposedly confirmed, my point will be that Epoch.ai is possibly just a PR firm for *Western* AI providers, then possibly this news is untruth worthy.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare... We have all heard this, pity the poor proof readers who will proof them all in a search for the holy grail = zero errors.
In a similar way, LLM's are permutational cross associating engines, matched with sieves to filter out the dross. Less filtering = more dross, AKA slop.
It can certainly create enormous masses of bad code and with well filtered screens for dross, we can see it can create passable code, however stray flaws(flies) can creep in and not get filtered, and humans are better at seeing flies in their oatmeal.
AI seems very good at permutational code assaults on masses of code to find the flies(zero days), so I expect it to make code more secure as few humans have the ability/time to mount that sort of permutational assault on code bases. I see this idea has already taken root within code writers as well as hackers/China etc.
These two opposing forces will assault code bases, one to break and one to fortify. In time there will be fewer places where code bases have hidden flaws as soon all new code will be screened by AI to find breaks so that little or no code will contain these bugs.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:
1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.
However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:
1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.
2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.
Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.
Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.
19 replies →
There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.
It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.
> Their capabilities should saturate at human or maybe above-average human performance
LLMs do have superhuman reasoning speed and superhuman dedication. Speed is something you can scale, and at some point quantity can turn into quality. Much of the frontier work done by humans is just dedication, luck, and remixing other people's ideas ("standing on the shoulders of giants"), isn't it? All of this is exactly what you can scale by having restless hordes of fast-thinking agents, even if each of those agents is intellectually "just above average human".
> 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.
The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.
LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.
The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.
> It doesn't discern between them, just looks for the best statistical fit
Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.
> "truly novel" and "solving unsolved math problem"
OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.
2 replies →
If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.
19 replies →
> It doesn't discern between them, just looks for the best statistical fit.
Why this is not true for humans?
4 replies →
> Which a formally novel things, but we really never needed any of that
The history of science and maths is littered with seemingly useless discoveries being pivotal as people realised how they could be applied.
It's impossible to tell what we really "need"
> LLMs can't understand what they are generating
You don't understand what "understanding" means. I'm sure you can't explain it. You are probably just hallucinating the feeling of understanding it.
> Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know...
Yeah.
LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3
> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631
Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939
> 2 165 842 392 930 662 831
Your example seems short enough to not pose a problem.
Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.
22 replies →
This hasn't been true for a while now.
I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.
Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.
This doesn’t address the author’s point about novelty at all. You don’t need 100% accuracy to have the capability to solve novel problems.
1 reply →
I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.
2 replies →
I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).
I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.
It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.
This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.
[1]: https://imgur.com/a/gWTGGYa
That's actually pretty cool. What made you think of doing this in the first place?
1 reply →
Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.
6 replies →
Why is ScreenCaptureKit a bad choice for performance?
3 replies →
What was the solution?
13 replies →
>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.
> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.
Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.
> This gives it the ability to approximate multiplication problems it hasn't seen before.
More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)
Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.
At this point, it's less misleading to say it knows the algorithm.
Yup, I agree with this. So based on this, where do you draw the line between what will be possible and what will not be possible?
Why are we reducing AIs to LLMs?
Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.
The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.
Which is exactly how humans learn many things too.
E.g. observing a game being played to form an understanding of the rules, rather than reading the rulebook
Or: Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Most inventions are an interpolation of three existing ideas. These systems are very good at that.
My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.
Indeed. Every time someone complains that LLMs can't come up with anything new, I'm assaulted with the depressing remembrance that neither do I.
I can't even find a good example of an invention that is not an interpolation.
2 replies →
The hardest part about any creativity is hiding your influences
This is poetry.
Beliefs are not rooted in facts. Beliefs are a part of you, and people aren't all that happy to say "this LLM is better than me"
I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.
Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.
12 replies →
It's not possible to know something without believing it to be true. https://en.wikipedia.org/wiki/Belief#/media/File:Classical_d...
2 replies →
I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.
It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.
A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:
1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).
2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.
I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.
I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.
[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...
> I think "novel" is ill defined here
That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.
As for evidence that they can do novel things, there is plenty:
1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.
2. SVGs of pelicans riding bicycles
3. People use LLMs to write new apps/code every day
4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available
5. LLMs have solved open problems in physics and mathematics [0,1]
That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).
[0] https://news.ycombinator.com/item?id=47497757
7 replies →
The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?
Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
3 replies →
> e.g. 167,383 * 426,397 = 71,371,609,051
They may be wrong, but so are you.
No, its correct:
https://www.google.com/search?q=167383+*+426397
3 replies →
You could have just checked the math yourself, you know.
2 replies →
[dead]
Ximm's Law applies ITT: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Especially the lemmas:
- any statement about AI which uses the word "never" to preclude some feature from future realization is false.
- contemporary implementations have almost always already been improved upon, but are unevenly distributed.
Anti-Ximm's Law: every response to a critique of AI assumes as much arbitrary level of future improvement as is necessary to make the case.
It is like not trusting someone who attained highest score in some exam by by-hearting the whole text book, to do the corresponding job.
Not very hard to understand.
Yet we do that all the time by hiring based on GPA/degree.
2 replies →
> asserting that LLMs will never generate 'truly novel' ideas or problem solutions
I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:
- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment
Can't think of many.
People rarely create things that are wholly new.
Most created things are remixes of existing things.
Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.
When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.
I think you're arguing with semantics.
Do we know for a fact that LLMs aren't now configured to pass simple arithmetic like this in a simpler calculator, to add illusion of actual insight?
The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.
You can train a LLM on just multiplication and test it on ones it has never seen before, it's nothing particularly magical.
1 reply →
> You need to say why it can do some novel tasks but could never do others.
This is actually quite a tall order. Reasoning about AI and making sense of what the LLMs are doing, and learning to think about it as technology, is a very difficult and very tricky problem.
You get into all kinds of weird things about a person’s outlook on life: personal philosophy, understanding of ontology and cosmology, and then whatever other headcanon they happen to be carrying around about how they think life works.
I know that might sound kind of poetic, but I really believe it’s true.
I am a great fan of Dr Richard Hamming and he gave a wonderful series of lectures on the topic. The book Learning to Learn has the full set of his lectures transcribed (highly recommend this book!).
But don't take my word for it, listen to Dr Hamming say it himself: https://www.youtube.com/watch?v=aq_PLEQ9YzI
"The biggest problem is your ego. The second biggest problem is your religion."
Yes! I call these the "it's just a stochastic parrot" crowd.
Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.
That would not be stochastic, just parroting
1 reply →
It's fear.
I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.
Also no matter how many math problems it solves it still gets lost in a codebase
LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.
If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,
Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".
12 replies →
[dead]
Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).
How many humans have been born until now and how many Einsteins have been born? And in how many hundreds of thousands of years?
2 replies →
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
The ability to learn and infer without absorbing millions of books and all text on internet really does make us special. And only at 20 watts!
Last I checked humans didn't pop into existence doing that. It happened after billions of years of brute force, trial and error evolution. So well done for falling into the exact same trap the OP cautions. Intelligence from scratch requires a mind boggling amount of resources, and humans were no different.
10 replies →
We have a tremendous amount of raw information flowing through our brains 24/7 from before we are born, from the external world through all our senses and from within our minds as it attempts to make sense of that information, make predictions, generally reason about our existence, hallucinate alternative realities, etc. etc.
If you were able to somehow capture all that information in full detail as you've had access to by the age of say 25, it would likely dwarf the amount of information in millions of books by several orders of magnitude.
When you are 25 years old and are presented a strange looking ball and told to throw it into a strange looking basket for the first time. You are relying on an unfathomable amount of information turned into knowledge and countless prior experiments that you've accumulated/exercised to that point relating to the way your body and the world works.
4 replies →
20 watts ignores the startup cost: Tens of millions of calories. Hundreds of thousands of gallons of water. Substantial resources from at least one other human for several years.
Just an interesting thought experiment: if you took all the sensory information that a child experiences through their senses (sight, hearing, smell, touch, taste) between, say, birth and age five, how many books worth of data would that be? I asked Claude, and their estimate was about 200 million books. Maybe that number is off ± by an order of magnitude. ...but then again Claude is only three years old, not five.
To be fair, the knowledge embedded in an LLM is also, at this point, a couple orders of magnitude (at least) larger than what the average human being can retain. So it's not like all those books and text in the internet are used just to bring them to our level, they go way beyond.
Now multiply that with 7 billion to distill that one who will solve frontier math problem.
Most people have absorbed way too few books to be able to infer properly. Hell, most people are confused by TV remotes.
It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all. An ai "checking its own work" is practically irrelevant when they all seem to go back and forth on whether you need the car at the carwash to wash the car. Undoubtedly people have been passing this set of problems to ai's for months or years and have gotten back either incorrect results or results they didn't understand, but either way, a human confirmation is required. Ai hasn't presented any novel problems, other than the multitudes of social problems described elsewhere. Ai doesn't pursue its own goals and wouldn't know whether they've "actually been achieved".
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
>It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all.
Replace ai with human here and that's...just how collaborative research works lol.
The only things moving faster than AI are the goalposts in conversations like this. Now we're at "sure, AI can solve novel problems, but it can't come up with the problems themselves on its own!"
I'm curious to see what the next goalpost position is.
1 reply →
Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs. Also, this has been active research for some time. Or I guess the people working on it are just not as good as a random bunch of students? It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
8 replies →
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
>But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
You can think whatever you want, but an untestable distinction is an imaginary one.
2 replies →
No, but it does mean that you should know we don't understand what intelligence is, and that maybe LLMs are actually intelligent and humans have the appearance of intelligence, for all we know.
11 replies →
Re: "I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique."
Perhaps this might better help you understand why this assumption still holds: https://en.wikipedia.org/wiki/Orchestrated_objective_reducti...
"Controversial theory justifies assumption". Because humans never hallucinate.
It doesn't. I actually completely reject that theory, and it's nice to see that Wikipedia notes that it is "controversial". There are extremely good reasons to reject this theory. For one thing, any quantum effects are going to be quite tiny/ trivial because the brain is too large, hot, wet, etc, to see larger effects, so you have to somehow make a leap to "tiny effects that last for no time at all" to "this matters fundamentally in some massive way".
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
It also would do nothing to indicate that human intelligence is unique.
it is not the assumption that humans are unique. it is that statistical models cannot really think out of the box most of the time
And you know that humans aren't statistical models how?
2 replies →
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Uh, because up until and including now, we are...?
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
23 replies →
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification. But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
> I don't see this getting better.
We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
50 replies →
This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.
1 reply →
Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.
7 replies →
> But once you go beyond that to less defined things such as code quality
I think they have a good optimization target with SWE-Bench-CI.
You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.
If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.
Do we need all that if we can apply AI to solve practical problems today?
2 replies →
LLMs already do unsupervised learning to get better at creative things. This is possible since LLMs can judge the quality of what is being produced.
LLMs can often guess the final answer, but the intermediate proof steps are always total bunk.
When doing math you only ever care about the proof, not the answer itself.
7 replies →
Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.
I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
I mean, this is why everyone is making bank selling RL environments in different domains to frontier labs.
>it really is a new and exciting world...
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
> there will be nothing really new
How is this the conclusion? Isn't this post about AI solving something new? What am I missing?
21 replies →
I heard this saying recently “The problem with comfort is that it makes you comfortable.”
AI can both explore new things and exploit existing things. Nothing forces it to only rehash old stuff.
>without actually making you like really happy or good.
What are you basing this off of. I've shared several AI songs with people in real life due to how much I've enjoyed them. I doing see why an AI playlist couldn't be good or make people happy. It just needs to find what you like in music. Again coming back to explore vs exploit.
9 replies →
On what do you base your prediction?
Is it because the AI is trained with existing data? But, we are also trained with existing data. Do you think that there's something that makes human brain special (other than the hundreds of thousands years of evolution but that's what AI is all trying to emulate)?
This may sound hostile (sorry for my lower than average writing skills), but trust me, I'm really trying to understand.
>We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
Source?
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
23 replies →
> AI is a remixer; it remixes all known ideas together.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
I don't think this is a correct explanation of how things work these days. RL has really changed things.
24 replies →
Turning a hard problem into a series of problems we know how to solve is a huge part of problem solving and absolutely does result in novel research findings all the time.
Standard problem*5 + standard solutions + standard techniques for decomposing hard problems = new hard problem solved
There is so much left in the world that hasn’t had anyone apply this approach purely because no research programme has decides that it’s worth their attention.
If you want to shift the bar for “original” beyond problems that can be abstracted into other problems then you’re expecting AI to do more than human researchers do.
I entered the prompt:
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
27 replies →
Here’s a simple prompt you can try to prove that this is false:
This is a fresh UUIDv4 I just generated, it has not been seen before. And yet it will output it.
17 replies →
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
1 reply →
No. That's wrong. LLMs don't output the highest probability taken: they do a random sampling.
3 replies →
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
This is false.
The ability for some people to perpetually move the goalpost will never cease to amaze me.
I guess that's one way to tell us apart from AIs.
1 reply →
We need a website with refutations that one can easily link to. This interpretations of LLMs is outdated and unproductive.
Yes, ChatGPT and friends are essentially the same thing as the predictive text keyboard on your phone, but scaled up and trained on more data.
12 replies →
Obligatory Everything is a Remix: https://www.youtube.com/watch?v=nJPERZDfyWc
Move 37.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
1 reply →
[dead]
I'm curious as to why you consider this as the benchmark for AI capabilities. Extremely few humans can solve hard problems or do much innovation. The vast majority of knowledge work requires neither of these, and AI has been excelling at that kind of work for a while now.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
Fair point, however I am simply more interested in how AI can advance frontiers than in how it can transcribe a meeting and give a summary or even print out React code. I know the world is heavily in need of the menial labor and AI already has made that stuff way easier and cheaper.
However I'm just very interested in innovation and pushing the boundaries as a more powerful force for change. One project I've been super interested in for a while is the Mill CPU architecture. While they haven't (yet) made a real chip to buy, the ideas they have are just super awesome and innovative in a lot of areas involving instruction density & decoding, pipelining, and trying to make CPU cores take 10% of the power. I hope the Mill project comes to fruition, and I hope other people build on it, and I hope that at some point AI could be a tool that prints out innovative ideas that took the Mill folks years to come up with.
1 reply →
most issues at every scale of community and time are political, how do you imagine AI will make that better, not worse?
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
https://www.darioamodei.com/essay/the-adolescence-of-technol...
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
I agree with what you're saying, and I certainly don't think the one problem facing my country or the world is just that we didn't solve the right math problem yet. I am saddened by the direction the world keeps moving.
When I wrote that I hope we use it for good things, I was just putting a hopeful thought out there, not necessarily trying to make realistic predictions. It's more than likely people will do bad things with AI. But it's actually not set in stone yet, it's not guaranteed that it has to go one way. I'm hopeful it works out.
It 100% will not be used to make the world better and we all know it will be weaponised first to kill humans like all preceding tech
Most tech gets used for good and bad.
Are the only two options AI doubter and AI believer?
Perhaps I should have elaborated more but what I mean is I used to think, "I genuinely don't see the point in even trying to use AI for things I'm trying to solve". Ironically though, I think that because I've repeatedly tried and tested AI and it falls flat on its face over and over. However, this article makes me more hopeful that AI actually could be getting smarter.
All I hear about are AI believers and AI-doubters-just-turned-believers
1 reply →
Asking the right questions...
I remember there was a conversation between two super-duper VCs (dont remember who but famous ones), about how DeepSeek was a super-genius level model because it solved an intro-level (like week 1-2) electrodynamics problem stated in a very convoluted way.
While cool and impressive for an LLM, I think they oversold the feat by quite a bit.
I don't want to belittle the performance of this model, but I would like for someone with domain expertise (and no dog in the AI race, like a random math PhD) to come forward, and explain exactly what the problem exactly was, and how did the model contribute to the solution.
> I really hope we use this intelligence resource to make the world better.
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
> I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world.
We already have a few years of experience with this.
> I really hope we use this intelligence resource to make the world better.
We already have a few years of experience with this.
The problem is that the AI industry has been caught lying about their accomplishments and cheating on tests so much that I can't actually trust them when they say they achieved a result. They have burned all credibility in their pursuit of hype.
I'm all for skeptical inquiry, but "burning all credibility" is an overreaction. We are definitely seeing very unexpected levels of performance in frontier models.
> born-again AI believer
sigh
I honestly do think I'm being honest with myself. I have held it in my mind that I'm not impressed until it's innovative. That threshold seems to be getting crossed.
I'm not saying, "I used to be an atheist, but then I realized that doesn't explain anything! So glad I'm not as dumb now!"
1 reply →
It's less of solving a problem, but trying every single solution until one works. Exhaustive search pretty much.
It's pretty much how all the hard problems are solved by AI from my experience.
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
In other words, it's solving a problem.
14 replies →
The link has an entire section on "The infeasibility of finding it by brute force."
No, that's precisely solving a problem.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
But this is exactly how we do math.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
How do you think mathematicians solve problems?
That's also the only way how humans solve hard problems.
6 replies →
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
I wonder what was in that solutions file they provided. According to the prompt it’s a solution template but I want to know the contents.
Another thing I want to know is how the user keeps updating the LLM with the token usage. I didn’t know they could process additional context midtask like that.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress.
It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
2 replies →
That's why context management is so important. AI not only get more expensive if you waste tokens like that, it may perform worse too
Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.
You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.
It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example.
Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.
Try the refactor again tomorrow. It might have gotten easier or more difficult.
> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
I think it's more of a data vs intelligence thing.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
I don't think so. I went through the output of Opus 4.6 vs GPT 5.4 pro. Both are given different directions/prompts. Opus 4.6 was asked to test and verify many things. Opus 4.6 tried in many different ways and the chain of thoughts are more interesting to me.
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
5 replies →
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
I mean the details are in the post. You can see the conversation history and the mathematician survey on the problem
The capabilities of AI are determined by the cost function it's trained on.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
> But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
Can you give some examples?
It is not trivial that not everything can be written as an optimization problem.
Even at the time advanced generalizations such as complex numbers can be said to optimize something, e.g. the number of mathematical symbols you need to do certain proofs, etc.
I think you're misreading me. My point isn't that you can't in principle state the optimization problem, but that it's much easier in some domains than in others, that this tracks with how AI has been progressing, and that progress in one area doesn't automatically mean progress in another, because current AI cost functions are less general than the cost functions that humans are working with in the world.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs. It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying. It probably doesn’t really advance the field, but it can turn conjectures into theorems.
I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future. Other people in I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future.
I only ever learned it in school, but if memory serves, Prolog is a whole "given these rules, find the truth" sort of language, which aligns well with these sorts of problem spaces. Mix and match enough, especially across disparate domains, and you might get some really interesting things derived and discovered that are low-hanging fruit just waiting to be discovered.
Indeed, can't find my old comment on the topic but that's indeed the point, it's not how feasible it is to "find" new proof, but rather how meaningful those proofs are. Are they yet another iteration of the same kind, perfectly fitting the current paradigm and thus bringing very little to the table or are they radical and thus potentially (but not always) opening up the field?
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
I'm of the opinion that everything we've discovered is via combinatorial synthesis. Standing on the shoulders of giants and all that. I'm not sure I've seen any convincing argument that we've discovered anything ex nihilo.
How about this guy? https://en.wikipedia.org/wiki/Srinivasa_Ramanujan
How do you think you can design a benchmark to solve truly novel problems?
I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
Super cool, of course.
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
https://epoch.ai/frontiermath/open-problems
I’d hope this isn’t a goal post move - an open math problem of any sort being solved by a language model is absolute science fiction.
That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.
1 reply →
It is not. You're operating under the assumption that all open math problems are difficult and novel.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
2 replies →
Someone has to explain to me exactly what is implied here? Looking at the prompt:
Why not search the internet? Is this an open problem or not? Can the solution be found online? Than it's an already solved problem no?
How much does that paper help, kind of seem like a pretty big hint.
And it sounds like the USER already knows the answer, the way that it prompts the model, so I'm really confused what we mean by "open problem", I at first assumed a never solved before problem, but now I'm not sure.
"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
Yes there's a whole ecosystem of companies that create and sell RL gyms to AI labs and of course they develop their own internally too. You don't hear much about this ecosystem because RL at scale is all private. Nearly no academic research on it.
A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.
I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.
Is it possible the AI labs are seeding their models with these solved problems? Like, if I was Sam Altman with a bazillion dollars of investment I would pay some mathematicians to solve some of these problems so that the models could "solve" them later on. Not that I think it's what's happening here of course...
But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.
Maybe so, but GPT 5.4 is absolutely pulling ahead. You can see the differences visually on https://minebench.ai/.
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
Usually involves a lot of agents and their custom contexts or system prompts.
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
That sounds like the start of a very lucrative career. Are you sure it was Gemini and not an AI competitor offering affiliate commission? ;)
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Read about Paul Erdös... not all math is the Riemann Hypothesis, there is yeoman's work connecting things and whatever...
Certainly knowing how many/which people are working on a problem you are looking at, and how long it will take you to solve it, are critical skills in being a working researcher. What kind of answer are you looking for? It's hard to quantify. Most suck at this type of assessment as a PhD student and then you get better as time goes on.
I feel like this single image perfectly sums up the entire thread here: https://trapatsas.eu/sites/llm-predictions/
It's not like this is new to AI
https://oertx.highered.texas.gov/courseware/lesson/1849/over...
Yes, and no matter when "now" is, the doubters will always see in their mind's eye the flat line extending to the right.
That's tautological
I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Considering that an LLM simply remixes what it finds in its learned distribution over text, it's possible that it can pose new math problems by identifying gaps ("obvious" in restrospect) that humans may have missed (like connecting two known problems to pose a new one). What LLMs can't currently do is pose new problems by observing the real world and its ramifications, like that moving sofa problem.
Yes. I doubt it can do that.
> This problem is about improving lower bounds on the values of a sequence, , that arises in the study of simultaneous convergence of sets of infinite series, defined as follows.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
It's deeply surprising to me that LLMs have had more success proving higher math theorems than making successful consumer software
Software developers have spent decades at this point discounting and ignoring almost all objective metrics for software quality and the industry as a whole has developed a general disregard for any metric that isn't time-to-ship (and even there they will ignore faster alternatives in favor of hyped choices).
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".
1 reply →
This doesn't pass a sniff test. We have plenty of ways to verify good software, else you wouldn't be making this post. You know what bad software is and looks like. We want something fast that doesn't throw an error every 3 page navigations.
You can ask an LLM to make code in whatever language you want. And it can be pretty good at writing efficient code, too. Nothing about NPM bloat is keeping you from making a lean website. And AI could theoretically be great at testing all parts of a website, benchmarking speeds, trying different viewports etc.
But unfortunately we are still on the LLM train. It just doesn't have anything built-in to do what we do, which is use an app and intuitively understand "oh this is shit." And even if you could allow your LLM to click through the site, it would be shit at matching visual problems to actual code. You can forget about LLMs for true frontend work for a few years.
And they are just increasingly worse with more context, so any non-trivial application is going to lead to a lot of strange broken artifacts, because text prediction isn't great when you have numerous hidden rules in your application.
So as much as I like a good laugh at failing software, I don't think you can blame shippers for this one. LLMs are not struggling in software development because they are averaging a lot of crap code, it's because we have not gotten them past unit tests and verifying output in the terminal yet.
They haven't, not at all as far as I can tell. This math problem appears to be a nice chore to be solved, the equivalent to "Claude, optimize this code" or "Write a parser", which is being done 100000x a day.
The original researchers who proposed this problem tried and failed multiple times to solve it. Does that sound like a 'nice chore to be solved' to you ?
15 replies →
But the title claims it is a "frontier" math problem, so which is it really.
Pretty much all consumer software made in 2026 is heavily using AI in its development. So I'm not sure what basis you have for your assertion.
There seems to be a focus on understanding when talking about LLMs and solving problems. Personally, I do not think understanding is required. I can write a very small program that can calculate Pi to however many digits I like, or calculate any digit in the sequence on demand, without the program or computer having any understanding at all of what Pi is or what it means. I could get Claude to output that same code when prompted to find a solution to generating Pi, also with no understanding of what Pi is, or what it means.
IMO the ability to provide an accurate solution to a problem is not always based on understanding the problem.
Impressive, but it will take away so much sense of accomplishment from so many people. I find that really sad.
No denial at this point, AI could produce something novel, and they will be doing more of this moving forward.
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
1 reply →
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
Clever/novel ideas are very often subtle deviations from known, existing work.
Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.
There is no such thing. All new ideas are derived from previous experiences and concepts.
2 replies →
"extrapolation" literally implies outside the extents of current knowledge.
1 reply →
How would you know if it wasn't an extrapolation of current knowledge? Can you point me to somethings humans have done which isn't an extrapolation?
1 reply →
[flagged]
Your analogy falls apart if we consider the number wasn't on the clock face.
5 replies →
I mean, I can run a pseudo random number generator, and produce something novel too.
Is this novel? It's new. But we already know AI can generate new things, any statistical reassembly of any content will generate new things.
It's not to downplay this, but it's unclear what "novel" means here or what you think the implications are.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
Is their scaffold available? Does it do anything special beyond feeding the warmup, single challenge, and full problem to an LLM? Because it's interesting that GPT-5.2 Pro, arguably the best model until a few months ago, couldn't even solve the warmup. And now every frontier model can solve the full problem. Even the non-Pro GPT-5.4. Also strange that Gemini 3 Deep Think couldn't solve it, whereas Gemini 3.1 Pro could. I read that Deep Think is based on 3.1 Pro. Is that correct?
I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?
Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.
I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.
I don't understand the position that learning through inference/example is somehow inferior to a top down/rules based learning.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
Do they also publish the raw output of the model, i.e. not only the final response but also everything generated for internal reasoning or tool use?
Reading this thread I'm reassured that despite everything AI may disrupt, humans arguing past each other about philosophy of knowledge and epistemology on internet forums is safe :')
Domain Experienced users are effectively training llms to mimic themselves in solving their problems, therefore/// solving their problems via chat data concentration.
Besides the point of the supposed achievement, that is supposedly confirmed, my point will be that Epoch.ai is possibly just a PR firm for *Western* AI providers, then possibly this news is untruth worthy.
Fantastic and exciting stuff!
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
I guess this means AI researchers should be out of jobs very soon.
Is it a coincidence that the first open problems solved by an LLM and a 4chan thread would be in the same field?
I feel like there’s a fork in our future approaching where we’ll either blossom into a paradise for all or live under the thumb of like 5 immortal VCs
Change is always hard, even if it will be good in 20 years, the transitions are always tough.
Sometimes the transition is tough and then the end state is also worse!
Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.
Been a long three years since single digit addition was a serious challenge for even top tier models
This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare... We have all heard this, pity the poor proof readers who will proof them all in a search for the holy grail = zero errors. In a similar way, LLM's are permutational cross associating engines, matched with sieves to filter out the dross. Less filtering = more dross, AKA slop. It can certainly create enormous masses of bad code and with well filtered screens for dross, we can see it can create passable code, however stray flaws(flies) can creep in and not get filtered, and humans are better at seeing flies in their oatmeal. AI seems very good at permutational code assaults on masses of code to find the flies(zero days), so I expect it to make code more secure as few humans have the ability/time to mount that sort of permutational assault on code bases. I see this idea has already taken root within code writers as well as hackers/China etc. These two opposing forces will assault code bases, one to break and one to fortify. In time there will be fewer places where code bases have hidden flaws as soon all new code will be screened by AI to find breaks so that little or no code will contain these bugs.
> This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare...
"Eventually" here is something on the order of a few expected lifespans of the universe.
The fact that we're getting meaningful results out of LLMs on a human timescale means that they're doing something very different.
Yes, the space is indeed deep/wide, but LLMs probably cull the herd as they proceed so they eliminate swathes as they go. Smart fuzzing in a way.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
But who asked the model to solve that problem?
This is impressive, but OpenAI is still shit as a company. How dare they even have "open" in their company name.
First prove the solution wasn’t in the training data. Otherwise it’s all just vibes and ‘trust me bro.’
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
wow nice
Really? He was steering the wheel the whole time. GPT didn't do the math.
Ah, the good old Clever Hans. https://en.wikipedia.org/wiki/Clever_Hans
Its an article from an ai site. People with vested interest are desperate to prove its not an expensive parrot.
We only get one shot.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
A model to whose internals we don't have access solved a problem we didn't knew was in their datasets. Great, I'm impressed