← Back to context

Comment by qnleigh

5 days ago

I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.

> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

> it's not because the model is figuring out something new

> LLMs will NEVER be able to do that, because it doesn't exist

It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.

If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?

I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:

1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.

However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:

1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.

2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

  • I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.

    Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.

    Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.

    • I think we may have similar perspectives. Regarding empirical knowledge, consider when the knowledge is in relation to chaotic systems. Characterize chaotic systems at least as systems where inaccurate observations about the system in the past and present while useful for predicting the future, nevertheless see the errors grow very quickly for the task of predicting a future state. Then indeed, prediction is difficult.

      One domain of knowledge I think you have yet to mention. We can talk about fundamentally computationally hard problems. What comes to mind regarding such problems that are nevertheless of practical benefit are physics simulations, material simulations, fluid simulations, but there exist problems that are more provably computationally difficult. It seems to me that with these systems, the chaotic nature is one where even if you have one infinitely precise observation of a deterministic system, accessing a future state of the system is difficult as well, even though once accessed, memorization seems comparatively trivial.

    • Where can I read about how LLMs have changed epistemology? Is there a field of philosophy that tries to define and understand 'intelligence'? That sounds very interesting.

      1 reply →

    • Also, we can do thought experiments, simulations in our heads, that often are as good as doing them for real - it has limitations and isn't perfect though. But it does work often. Einstein used to purposely dose off in a weird position so that something hit his leg or something like that to slightly nudge him half awake so he could remember his half-dreaming state - which is where he discovered some things

      1 reply →

    • > Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms.

      Not really; the normal way that math progresses, just like everything else, is that you get some interesting results, and then you develop the theoretical framework. We didn't receive the axioms; we developed them from the results that we use them to prove.

      5 replies →

    • > distinction between deductive and inductive knowledge

      There's also intuitive knowledge btw.

      Anyway, the recent developments of AI make a lot of very interesting things practically possible. For example, our society is going to want a way to reliably tell whether something is AI generated, and a failure to do so pretty much settles the empirical part of the Turing test issue. Or alternatively if we actually find something that AI can't reliably mimic in humans, that's going to be a huge finding. By having millions of people wonder whether posts on social media are AI generated, it is the largest scale Turing test we have inadvertently conducted.

      The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting. If humans are not bogged down by the small details or cost of implementation concerns, and we can just say what we want and get what we wished for (digitally), what level of creativity can we reach?

      Also once we get the robots to do things in the physical space...

      6 replies →

  • There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.

    It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.

  • > Their capabilities should saturate at human or maybe above-average human performance

    LLMs do have superhuman reasoning speed and superhuman dedication. Speed is something you can scale, and at some point quantity can turn into quality. Much of the frontier work done by humans is just dedication, luck, and remixing other people's ideas ("standing on the shoulders of giants"), isn't it? All of this is exactly what you can scale by having restless hordes of fast-thinking agents, even if each of those agents is intellectually "just above average human".

  • > 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

    Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.

  • The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.

LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.

The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.

  • > It doesn't discern between them, just looks for the best statistical fit

    Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.

    > "truly novel" and "solving unsolved math problem"

    OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.

  • If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.

    • It probably can, but won't realize that and it won't be efficient in that. LLM can shuffle tokens for an enormous number of tries and eventually come up with something super impressive, though as you yourself have mentioned, we would need to have a mandatory verification loop, to filter slop from good output and how to do it outside of some limited areas is a big question. But assuming we have these verification loops and are running LLMs for years to look for something novel. It's like running an energy grid of small country to change a few dozen of database entries per hour. Yes, we can do that, but it's kinda weird thing to do. But it is novel, no argue about that. Just inefficient.

      We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.

      5 replies →

  • > It doesn't discern between them, just looks for the best statistical fit.

    Why this is not true for humans?

    • We can't tell yet if that is true, partially true, or false for humans. We do know that LLM can't do anything else besides that (I mean as a fundamental operating principle).

      3 replies →

  • > Which a formally novel things, but we really never needed any of that

    The history of science and maths is littered with seemingly useless discoveries being pivotal as people realised how they could be applied.

    It's impossible to tell what we really "need"

  • > LLMs can't understand what they are generating

    You don't understand what "understanding" means. I'm sure you can't explain it. You are probably just hallucinating the feeling of understanding it.

    > Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know...

    Yeah.

LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3

> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631

Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939

> 2 165 842 392 930 662 831

Your example seems short enough to not pose a problem.

  • Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.

    • This is special pleading.

      Long multiplication is a trivial form of reasoning that is taught at elementary level. Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper. That was Opus with extended reasoning, it had all the opportunity to get it right, but didn't. There are people who can quickly multiply such numbers in their head (I am not one of them).

      LLMs don't reason.

      18 replies →

  • This hasn't been true for a while now.

    I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.

    Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.

  • This doesn’t address the author’s point about novelty at all. You don’t need 100% accuracy to have the capability to solve novel problems.

  • I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.

    • Other comments indicate that asking it to do long multiplication does work, but the varying results makes sense: LLMs are probabilistic, you probably rolled an unlikely result.

      1 reply →

I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).

I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.

It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.

This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.

[1]: https://imgur.com/a/gWTGGYa

  • That's actually pretty cool. What made you think of doing this in the first place?

    • Thanks! I've been doing a lot of work on a laptop screen (I normally work on an ultrawide) and got tired of constantly switching between windows to find the information I need.

      I've also added the ability to create a picture-in-picture section of any application window, so you can move a window to the background while still seeing its important content.

      I'll probably do a Show HN at some point.

  • Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.

    • Unless you worked on the macOS content server directly you’d have no idea that my solution was even possible.

      That fact that Claude skipped over all the obvious solutions is why I used the word novel.

      5 replies →

  • Why is ScreenCaptureKit a bad choice for performance?

    • Because you can't control what the content server is doing. SCK doesn't care if you only need a small section of a window: it performs multiple full window memory copies that aren't a problem for normal screen recorders... but for a utility like mine, the user needs to see the updated content in milliseconds.

      Also, as I mentioned above, when using SCK, the user cannot minimize or maximize any "watched" window, which is, in most cases, a deal-breaker.

      My solution runs at under 2% cpu utilization because I don't have to first receive the full window content. SCK was not designed for this use case at all.

      2 replies →

  • What was the solution?

    • Well, I'm not going to share either solution as this is actually a pretty useful utility that I plan on releasing, but the short answer is: 1) don't use ScreenCaptureKit, and 2) take advantage of what CGWindowListCreateImage() offers through the content server. This is a simple IPC mechanism that does not trigger all the SKC limitations (i.e., no multi-space or multi-desktop support). In fact, when using SKC, the user cannot even minimize the "watched" window.

      Claude realized those issues right from the start.

      One of the trickiest parts is tracking the window content while the window is moving - the content server doesn't, natively, provide that information.

      12 replies →

>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.

> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.

Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.

  • > This gives it the ability to approximate multiplication problems it hasn't seen before.

    More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)

    Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.

    At this point, it's less misleading to say it knows the algorithm.

  • Yup, I agree with this. So based on this, where do you draw the line between what will be possible and what will not be possible?

  • Why are we reducing AIs to LLMs?

    Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.

    The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.

  • Which is exactly how humans learn many things too.

    E.g. observing a game being played to form an understanding of the rules, rather than reading the rulebook

    Or: Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.

Most inventions are an interpolation of three existing ideas. These systems are very good at that.

  • My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.

  • Indeed. Every time someone complains that LLMs can't come up with anything new, I'm assaulted with the depressing remembrance that neither do I.

Beliefs are not rooted in facts. Beliefs are a part of you, and people aren't all that happy to say "this LLM is better than me"

  • I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.

    Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.

    • But the question isn't whether you can get LLMs to do something novel, it's whether anyone can get them to do something novel. Apparently someone can, and the fact that you can't doesn't mean LLMs aren't good for that.

      11 replies →

I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.

It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.

A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:

1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).

2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.

I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.

I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...

  • > I think "novel" is ill defined here

    That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.

    As for evidence that they can do novel things, there is plenty:

    1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.

    2. SVGs of pelicans riding bicycles

    3. People use LLMs to write new apps/code every day

    4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available

    5. LLMs have solved open problems in physics and mathematics [0,1]

    That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).

    [0] https://news.ycombinator.com/item?id=47497757

  • The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?

    Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

    • > The “good deal of evidence” is everywhere. The proof is in the pudding.

      I'm open! Please, by all means.

      > the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?

      The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

      > Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.

      Okay.

      > That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).

      This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

      As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

      > It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

      Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

      So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

      2 replies →

Ximm's Law applies ITT: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.

Especially the lemmas:

- any statement about AI which uses the word "never" to preclude some feature from future realization is false.

- contemporary implementations have almost always already been improved upon, but are unevenly distributed.

  • Anti-Ximm's Law: every response to a critique of AI assumes as much arbitrary level of future improvement as is necessary to make the case.

It is like not trusting someone who attained highest score in some exam by by-hearting the whole text book, to do the corresponding job.

Not very hard to understand.

> asserting that LLMs will never generate 'truly novel' ideas or problem solutions

I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:

- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment

Can't think of many.

People rarely create things that are wholly new.

Most created things are remixes of existing things.

Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.

When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.

I think you're arguing with semantics.

Do we know for a fact that LLMs aren't now configured to pass simple arithmetic like this in a simpler calculator, to add illusion of actual insight?

  • The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.

  • You can train a LLM on just multiplication and test it on ones it has never seen before, it's nothing particularly magical.

    • It's not 'magic' though but previously LLMs have performed very badly on longer multiplication, 'insight' is the wrong word but I'm saying maybe they're not wildly better at this calculation... maybe they are just optimising these well known jagged edges.

> You need to say why it can do some novel tasks but could never do others.

This is actually quite a tall order. Reasoning about AI and making sense of what the LLMs are doing, and learning to think about it as technology, is a very difficult and very tricky problem.

You get into all kinds of weird things about a person’s outlook on life: personal philosophy, understanding of ontology and cosmology, and then whatever other headcanon they happen to be carrying around about how they think life works.

I know that might sound kind of poetic, but I really believe it’s true.

I am a great fan of Dr Richard Hamming and he gave a wonderful series of lectures on the topic. The book Learning to Learn has the full set of his lectures transcribed (highly recommend this book!).

But don't take my word for it, listen to Dr Hamming say it himself: https://www.youtube.com/watch?v=aq_PLEQ9YzI

"The biggest problem is your ego. The second biggest problem is your religion."

Yes! I call these the "it's just a stochastic parrot" crowd.

Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.

I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.

Also no matter how many math problems it solves it still gets lost in a codebase

  • LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.

    If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,

  • Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".

    • > fundamentally misunderstand how AI works

      Who does fundamentally understand how LLMs work? Many claims flying around these days, all backed by some of the largest investments ever collectively made by humans. Lots of money to be lost because of fundamental misunderstandings.

      Personally, I find that AI influencers conveniently brush away any evidence (like inability to perform basic arithmetic) about how LLMs fundamentally work as something that should be ignored in favor of results like TFA.

      Do LLMs have utility? Undoubtedly. But it’s a giant red flag for me that their fundamental limitations, of which there are many, are verboten to be spoken about.

      11 replies →

Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).

  • How many humans have been born until now and how many Einsteins have been born? And in how many hundreds of thousands of years?

    • The point is that humans do have some edge compared to current LLMs which are essentially next token predictors. If we all start relying on current AI and stop thinking, we would only be able to "exhaust the remix space" of existing ideas but won't be able to do any paradigm jumps. Moreover, it's quite likely that current training sets are self-contradictory, containing Dutch books, carrying some innate error in them.

      1 reply →