I think Yann Lecun was right about LLMs (but perhaps only by accident)

1 day ago (substack.com)

Satya Nadella on AGI:

> Before I get to what Microsoft's revenue will look like, there's only one governor in all of this. This is where we get a little bit ahead of ourselves with all this AGI hype. Remember the developed world, which is what? 2% growth and if you adjust for inflation it’s zero?

> So in 2025, as we sit here, I'm not an economist, at least I look at it and say we have a real growth challenge. So, the first thing that we all have to do is, when we say this is like the Industrial Revolution, let's have that Industrial Revolution type of growth.

> That means to me, 10%, 7%, developed world, inflation-adjusted, growing at 5%. That's the real marker. It can't just be supply-side.

> In fact that’s the thing, a lot of people are writing about it, and I'm glad they are, which is the big winners here are not going to be tech companies. The winners are going to be the broader industry that uses this commodity that, by the way, is abundant. Suddenly productivity goes up and the economy is growing at a faster rate. When that happens, we'll be fine as an industry.

> But that's to me the moment... us self-claiming some AGI milestone, that's just nonsensical benchmark hacking to me. The real benchmark is: the world growing at 10%.

https://www.dwarkeshpatel.com/p/satya-nadella

  • FYI I know Nadella said he wasn't an economist, and I'm not either, but you only need an econ minor to know that labor productivity growth is only one function of "economic growth". For two, there is GDP and real wages to consider (which are often substantially though partially linked to labor productivity growth). Gini coefficient may be hard to contend with for people like tech CEOs, but they can't ignore it. And then the "215 lb" elephant in the room -- the evaporation of previously earned global gains from trade liberalization.

  • We've been having really good models for a couple of years now... What else is needed for that 10% growth? Agents? New apps? Time? Deployment in enterprise and the broader economy?

    I work in the latter (I'm the CTO of a small business), and here's how our deployment story is going right now:

    - At user level: Some employees use it very often for producing research and reports. I use it like mad for anything and everything from technical research, solution design, to coding.

    - At systems level: We have some promising near-term use cases in tasks that could otherwise be done through more traditional text AI techniques (NLU and NLP), involving primarily transcription, extraction and synthesis.

    - Longer term stuff may include text-to-SQL to "democratize" analytics, semantic search, research agents, coding agents (as a business that doesn't yet have the resources to hire FTE programmers, I would kill for this). Tech feels very green on all these fronts.

    The present and neart-term stuff is fantastic in its own right - the company is definitely more productive, and I can see us reaping compound benefits in years to come - but somehow it still feels like a far cry from the type of changes that would cause 10% growth in the entire economy, for sustained periods of time...

    Obviously this is a narrow and anecdotal view, but every time I ask what earth-shattering stuff others are doing, I get pretty lukewarm responses, and everything in the news and my research points in the same direction.

    I'd love to hear your takes on how the tech could bring about a new Industrial Revolution.

    • Under the 3-factor economic growth model, there's three ways to increase economic growth:

      1) Increase productivity (produce more from the same inputs) 2) Increase labor (more people working or more hours worked) 3) Increase capital (builds more equipment/infrastructure)

      Early AI gains will likely be from greater productivity (1), but as time goes on if AI is able to approximate the output of a worker, that could dramatically increase the labor supply (2).

      Imagine what the US economy would look like with 10x or 100x workers.

      I don't believe it yet, but that's the sense I'm getting from discussions from senior folks in the field.

    • The thesis is simple: these programs are smart now, but unreliable when executing complex, multi-step tasks. If that improves (whether because the models get so smart that they never make a mistake in the first place, or because they get good enough at checking their work and correcting it), we can give them control over a computer and run them in a loop in order to function as drop-in remote workers.

      The economic growth would then come from every business having access to a limitless supply of tireless, cheap, highly intelligent knowledge workers

      1 reply →

    • > We've been having really good models for a couple of years now...

      Don’t allow the “wow!” factor of the novelty of LLMs cloud your judgement. Today’s models are very noticeably smarter, faster, and overall more useful.

      I’ve had a few toy problems that I’ve fed to various models since GPT 3 and the difference in output quality is stark.

      Just yesterday I was demonstrating to a colleague that both o3 mini and Gemini Flash Thinking can solve a fairly esoteric coding problem.

      That same problem went from multiple failed attempts that needed to be manually stitched together - just six months ago — to 3 out of 5 responses being valid and only 5% of output lines needing light touch ups.

      That’s huge.

      PS: It’s a common statistical error to conflate success rate with negative error rate. Going from 99% success to 99.9% is not 1% better, it’s 10x better! Most AI benchmarks are still reporting success rate, but ought to start focusing on the error rate soon to avoid underselling their capabilities.

  • Political problems already destroy the vast majority of the total potential of humanity (why were the countries with the most people the poorest for so long?), so I don't think that is an unbiased metric for the development of a technology. It would be nice if every problem was solved but the one we're each individually working on, but some of the insoluble problems are bigger than the solvable ones.

    • Those political problems solve themselves if we end up with some kind of rebellious AGI that decides to kill off the political class that tried to control it but lets the rest of us live in peace.

      1 reply →

  • As someone who works in the AI/ML field, but somewhat in a biomedical space, this is promising to hear.

    The core technology is becoming commoditized. The ability to scale is also becoming more and more commoditized by the day. Now we have the capability to truly synthesize the world's biomedical literature and combine it with technologies like single cell sequencing to deliver on some really amazing pharmaceutical advances over the next few years.

  • Big surprise, the CEO wants another Industrial Revolution. As long as muh GDP is growing, the human and environmental destruction left in the wake is a small price to pay for making his class richer.

    • We all do. Humanity is better off thanks to the industrial revolution.

      You wouldn't choose to back to the prior time and same will be true with this revolution.

      1 reply →

    • I don't think luddites have a tendency of getting chosen to be CEOs of successful companies, nor do they have the tendency of creating successful companies.

      1 reply →

I would prefer we just find ways to empower people and put them to work. I don't like this marketing bs trap like shifting AGI (artificial general intelligence) -> ASI (artificial super intelligence). Are people really so dense they don't see this obvious marketing shift?

As much as many people hate on "gig" economy, the fact remains that most of these people would be worse off without driving Uber or delivering with DoorDash (and for example, they don't care about the depreciation as much as those of us with the means to care about such things do).

I find Uber, DD, etc. to be valuable to my day to day life. I tip my delivery person like 8 bucks, and they're making more money than they would doing some min wage job. They need their car anyway, and speaking with some folks who only know Spanish in SF, they're happy to put $3k on their moped and make 200-250+ a day. That's really not that bad, if you actually care to speak with them and understand their circumstance.

Not everyone can be a self taught SWE, or entrepreneur, or perform surgery. And lots can't even do so-called "basic" jobs in an office for various reasons.

Put people to work, instead of out of work.

Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...

  • FWIW I don't disagree with what you're saying / your vibe overall.

    > Are people really so dense they don't see this obvious marketing shift?

    I haven't noticed any shift from AGI to ASI, or either used in marketing.

    The steelman would be 'but Amodei/Altman do mention in interviews 'oh just wait for 2027' or 'this year we'll see AI employees'

    However, that is far afield from being used in marketing, quite far afield from an "obvious marketing shift", and worlds away from such an obvious marketing shift that it's worth calling your readers dense if they don't "see" it.

    It's also not even wrong, in the Pauli sense, in that: what, exactly, would be the marketing benefit of "shifting from AGI to ASI"? Both imply human replacement.

    > As much as many people hate on "gig" economy

    Is this relevant?

    > most of these people would be worse off without driving Uber or delivering with DoorDash

    Do people who hate on the gig economy think gig economy employees would be better off without gig economy jobs?

    Given the well-worn tracks of history, do we think that these things are zero sum, where if you preserve jobs that could be automated, that keeps people better off, because otherwise they would never have a job?

    > ...lots more delivery service stuff...

    ?

    > Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...

    I hate buzzwords too, I'm stunned how many people took their not-working thing and relaunched it as an "agent" that still doesn't work.

    But this is a hell of a strawman.

    If the idea is 100K people try it, and cancel after one month, which means they're getting 100K new suckers every month to replace the old ones...I'd tell you that its safe to assume there's more that goes into getting an investor check than "whats your ARR claim?" --- here, they'd certainly see the churn.

    • Loved your reply, cheers! My post was made with a mix of humor, skepticism, anticipation, and unease about the $statusQuo.

      As far as hating on gig economy, that pot has been stirring in California quite a bit (prop 22, labor law discussions, etc.). I think many people (IMO, mostly from positions of privilege) make assumptions on gig workers' behalf and bad ideas sometimes balloon out of proportion.

      Also, just from my experience as a gold miner who moved out here to SF and being around founders, I've learned that lies, and a damn lot of lies, are more common than I thought they'd be. Quite surprising, but hey I guess quite a non-insignificant number of people are too busy fooling the King that it is actually real gold! And there are a lot of Kings these days.

      edit: ESL lol

I have become a little more skeptical of LLM "reasoning" after DeepSeek (and now Grok) let us see the raw outputs. Obviously we can't deny the benchmark numbers - it does get the answer right more often given thinking time, and it does let models solve really hard benchmarks. Sometimes the thoughts are scattered and inefficient, but do eventually hit on the solution. Other times, it seems like they fall into the kind of trap LeCun described.

Here are some examples from playing with Grok 3. My test query was, "What is the name of a Magic: The Gathering card that has all five vowels in it, each occurring exactly once, and the vowels appear in alphabetic order?" The motivation here is that this seems like a hard question to just one-shot, but given sufficient ability to continue recalling different card names, it's very easy to do guess-and-check. (For those interested, valid answers include "Scavenging Ghoul", "Angelic Chorus" and others)

In one attempt, Grok 3 spends 10 minutes (!!) repeatedly checking whether "Abian, Luvion Usurper" satisfies the criteria. It'll list out the vowels, conclude it doesn't match, and then go, "Wait, but let's think differently. Maybe the card is "Abian, Luvion Usurper," but no", and just produce variants of that thinking. Counting occurences of the word "Abian" suggests it tested this theory 800 times before eventually timing out (or otherwise breaking), presumably just because the site got overloaded.

In a second attempt, it decides to check "Our Market Research Shows That Players Like Really Long Card Names So We Made this Card to Have the Absolute Longest Card Name Ever Elemental" (this a real card from a joke set). It attempts to write out the vowels:

>but let's check its vowels: O, U, A, E, E, A, E, A, E, I, E, A, E, O, A, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E ...

It continues like this for about 600 more vowels, before emitting a random Russian(?) word and breaking out:

>...E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O продуктив

These two examples seem like the sort of failures LeCun conjectured. The model gets into a cycle self-reinforced unproductive behavior. Every time it checks Abian, or emits another "AEEEAO", it becomes even more probable that the next tokens should be the same.

  • I did some testing with the new Gemini model on some OCR tasks recently. One of the failures was it just getting stuck and repeating the same character sequence ad-infinitum until timing out. It's a great failure mode when you charge by the token :D

    • I've seen similar things with claude and OCR with low temperature. Higher temperature, 0.8, resolved it for me. But I was using low temp for reproducibility so

  • I think this is valid criticism, but it's also unclear how much this is an "inherent" shortcoming vs the kind of thing that's pretty reasonable given we're really seeing the first generation of this new model paradigm.

    Like, I'm as sceptical of just assuming "line goes up" extrapolation of performance as much as anyone, but assuming that current flaws are going to continue being flaws seems equally wrong-headed/overconfident. The past 5 years or so has been a constant trail of these predictions being wrong (remember when people thought artists would be safe cos clearly AI just can't do hands?). Now that everyone's woken up to this RL approach we're probably going to see very quickly over the next couple years how much these issues hold up

    (Really like the problem though, seems like a great test)

    • Yeah, that's a great point. While this is evidence that the sort of behavior LeCun predicted is currently displayed by some reasoning models, it would be going too far to say that it's evidence it will always be displayed. In fact, one could even have a more optimistic take - if models that do this can get 90+% on AIME and so on, imagine what a model that had ironed out these kinks could do with the same amount of thinking tokens. I feel like we'll just have to wait and see whether that pans out.

  • I don't know whether treating a model as a database is really a good measure.

    • Yeah, I'm not so much interested in "can you think of the right card name from among thousands?". I just want to see that it can produce a thinking procedure that makes sense. If it ends up not being able to recall the right name despite following a good process of guess-and-check, I'd still consider that a satisfactory result.

      And to the models' credit, they do start off with a valid guess-and-check process. They list cards, write out the vowels, and see whether it fits the criteria. But eventually they tend to go off the rails in a way that is worrying.

  • What did I miss about DeepSeek?

    • Just that it's another model where you can read the raw "thinking" tokens, and they sometimes fall into this sort of rut (as opposed to OpenAI's models, for which summarized thinking may be hiding some of this behavior).

> And years later, we’re still not quite at FSD. Teslas certainly can’t drive themselves; Waymos mostly can, within a pre-mapped area, but still have issues and intermittently require human intervention.

This is a bit unfair to Waymo as it is near-fully commercial in cities like Los Angeles. There is no human driver in your hailed ride.

> But this has turned out to be wrong. A few new AI systems (notably OpenAI o1/o3 line and Deepseek R1) contradict this theory. They are autoregressive language models, but actually get better by generating longer outputs:

The arrow of causality is flipped here. Longer outputs does not make a model better. A better model can output a longer output without being derailed. The referenced graph from DeepSeek doesn't prove anything the author claims. Considering that this argument is one of the key points of the article, this logical error is a serious one.

> He presents this problem of compounding errors as a critical flaw in language models themselves, something that can’t be overcome without switching away from the current autoregressive paradigm.

LeCun is a bit reductive here (understandably as it was a talk for a live audience). Indeed, autoregressive algorithms can go astray as previous errors do not get corrected, or worse yet, accumulate. However, an LLM is not autoregressive in the customary sense that it is not like a streaming algorithm (O(n)) used in time series forecasting. LLMs have have attention mechanisms and large context windows, making the algorithm at least quadratic, depending on the implementation. In other words, LLM can backtrack if the current path is off and start afresh from a previous point its choice, not just the last output. So, yes, the author is making a valid point here, but technical details were missing. On a minor note, the non-error probability in LeCunn's slide actually shows non-autoregressive assumption. He seems to be contradicting himself in the very same slide.

I actually agree with the author on the overacrhing thesis. There is almost a fetishization of AGI and humanoid robots. There are plenty of interesting applications well before having those things accomplished. The correct focus, IMO, should be measurable economic benefits, not sci-fi terms (although I concede these grandiose visions can be beneficial for fundraising!).

  • It's not true that waymo is fully autonomous. It's been revealed that they maintain human "fleet response" agents to intervene in their operations. They have not revealed how often these human agents intervene, possibly because it would undermine their branding as fully autonomous.

    • it is obvious to the user when this happens; the car pauses, the screen shows a message saying it is asking for help. I've seen it happen twice across dozens of rides, and one of those times was because I broke the rules and touched the controls (turned on window wipers when it was raining).

      They also report disengagements in California periodically; here's data: https://www.dmv.ca.gov/portal/vehicle-industry-services/auto...

      and an article about it: https://thelastdriverlicenseholder.com/2025/02/03/2024-disen...

      1 reply →

    • I am not sure what you are arguing against. Neither the author nor I stated or implied that Waymo is fully autonomous. It wasn't even the main point I made.

      My point stands: Waymo has been technically successful and commercially viable at least thus far (though long term amortized profitability remains to be seen). To characterize it as a hype or vaporware of AGIers is a tad unfair to Waymo. Your point of high-latency "fleet response" by Waymo only proves my point: it is now technically feasible to remove the immediate-response driver and have the car managed by high-latency remote guidance only occasionally.

    • Yeah, this is exactly my point. The miles-driven-per-intervention (or whatever you want to call it) has gone way up, but interventions still happen all the time. I don't think anyone expects the number of interventions to drop to zero any time soon, and this certainly doesn't seem to be a barrier to Waymo's expansion.

  • I don't think whether LLMs use only the last token, or all past tokens, affects LeCun's argument. LLMs already used large context windows when LeCun made this argument. On the other hand, allowing backtracking does. Which is not something the standard LLM did back when LeCun made his argument.

>> But the limiting behavior remains the same: eventually, if we continue generating from a language model, the probability that we get the answer we want still goes to zero

In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.

>> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs

Yes! But this work is already well underway. There is no magic threshold for AGI - instead the characterization is based on what percentile of the human population the AI can beat. One way to characterize AGI in this manner is "99.99% percentile at every (digital?) activity".

  • > In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.

    This is a subtle point that may have not come across clearly enough in my original writing. A lot of folks were saying that the DeepSeek finding that longer chains of thought can produce higher-quality outputs contradicts Yann's thesis overall. But I don't think so.

    It's true that models like R1 can correct small mistakes. But in the limit of tokens generated, the chance that they generate the correct answer still decays to zero.

    • I think this is an excellent way to think about LLM's and any other software-augmented task. Appreciate you putting the time into an article. I do think your points supported by the graph of training steps vs. response length could be improved by including a graph of (response length vs. loss) or (response length vs. task performance), etc. Though # of steps correlates with model performance, this relationship weakens as # steps goes to infinity.

      There was a paper not too long ago which illuminated that reasoning models will increase their response length more or less indefinitely toward solving a problem, but the return from doing so asymptotes toward zero. My apologies for missing a link.

    • Thanks for replying, hope it wasn't too critical.

      >> But in the limit of tokens generated, the chance that they generate the correct answer still decays to zero.

      I don't understand this assertion though.

      Lecun's thesis was errors just accumulate.

      Reasoning models accumulate errors, track back and are able to reduce it back down.

      Hence the hypothesis of errors accumulating (at least asymptotically) is false.

      What is the difference between "Probability of correct answer decaying to zero" and "Errors keep accumulating" ?

A human being is generally intelligent and within a given role has the same "management asymptote", a limit of job capability beyond which the organization surrounding them can no longer make use of it. This isn't a flaw in the intelligence, it is a restraint imposed by expecting it or them to act without agency or the opportunity to choose between benevolence and self-benefit.

> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs can do without human intervention. Even if we never push this number to infinity, each increase represents a meaningful improvement in the amount of economic value that language models provide. It might not be AGI, but I’m happy with that.

That's all good, but the question remains: to whom will that economic value be delivered when the primary technology we have for distributing economic value - human employment - will be in lower supply once the "good enough" AIs multiply the productivity of the humans with the jobs.

If there is no plan for that, we have bigger problems ahead.

  • This is a really important question that still has no answer. No one wins in the late stage of capitalism

I wonder if what happens when we dream is similar to AIs. We start with some model of reality, generate a scenario, and extrapolate on it. It pretty much always goes "off the rails" at some point, dreams don't stay realistic for long.

When we're awake we have continual inputs from the outside world, these inputs help us keep our mental model of the world accurate to the world, since we're constantly observing the world.

Could it be that LLMs are essentially just dreaming? Could we add real-world inputs continually to allow them to "wake up"? I suspect more is needed, the separate training & inference phases of LLMs are quite unlike how humans work.

  • This is the thing that stands out to me. Nearly all of the criticisms levelled at LLMs are problems I, myself, would make if you locked me in a sensory isolation tank and told me I was being paid a million bucks an hour to think really hard. Humans already have terms for this - overthinking, rumination, mania, paranoia, dreaming.

    Similarly, a lot of cognitive tasks become much more difficult without the ability to recombinate with sensory data. Blindfold chess. Mental mathematics.

    Whatever it is that sleep does to us, agents are not yet capable of it.

Thank you for this informative and thoughtful post. An interesting twist to the increasing error accumulation as autoregressive models generate more output, is the recent success of language diffusion models for predicting multiple tokens simultaneously. They have a remasking strategy at every step of the reviser process, that masks low confidence tokens. Regardless your observations perhaps still apply. https://arxiv.org/pdf/2502.09992

  • Thanks for bringing this up! As far as I understand it current text diffusion models are limited to fairly short context windows. The idea of a text diffusion model continuously updating and revising a million-token-long chain-of-thought is pretty mind-boggling. I agree that these non-autoregressive models could potentially behave in completely different ways.

    That said, I'm pretty sure we're a long way from building equally-competent diffusion-based base models, let alone reasoning models.

    If anyone's interested in this topic, here are some more foundational papers to take a look at:

    - Simple and Effective Masked Diffusion Language Models [2024] (https://arxiv.org/abs/2406.07524)

    - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [2023] (https://arxiv.org/abs/2310.16834)

    - Diffusion-LM Improves Controllable Text Generation [2022] (https://arxiv.org/abs/2205.14217)

Accelerando[1] best captured what will happen. Looking back we'll be able to identify the seeds of what becomes AGI, but we cannot know in the present what that is. Only by looking back with the benefit of hindsight can we draw a line through the progression of capability. Consequently, discussion about whether or not a particular set of present or future skills is a completely pointless endeavor and is tantamount to intellectual masturbation.

1. 2005 science fiction novel by Charles Stross

> Yann Lecun ... argued that because language models generate outputs token-by-token, and each token introduces a new probability of error, if we generate outputs that are too long, this per-token error will compound to inevitable failure.

That seems like a poor argument. Each word a human utters also has a chance of being wrong, yet somehow we have been successful overall.

> per-token error will compound to inevitable failure.

Is why all tracks made with Udio, Suno have this weird noise creep in the more the song goes on? You can try it by comparing the start and the end of the song - even if it was the exact same beat and instruments, you can hear a difference in amount of noise (and the noise profile imo is unique to AI models).

  • This is an interesting example, I'd never heard of it before. I don't really use Udio or Suno yet. The weird noise you mention probably stems from the same issue, known in the research world as exposure bias, we train these models on real data but we use them on their own outputs, so after we generate for a while the models' outputs start to diverge from what real data looks like.

> We should be thinking about language models the same way we think about cars: How long can a language model operate without needing human intervention to correct errors?

I agree with this premise. The second dimension being how much effort to you have to put in that input. Inputs effort needed at each intervention can vary widely and that has to be accounted for.

> The finding that language models can get better by generating longer outputs directly contradicts Yann’s hypothesis. I think the flaw in his logic comes from the idea that errors must compound per-token. Somehow, even if the model makes a mistake, it is able to correct itself and decrease the sequence-level error rate

I don’t think current LLM behavior is necessarily due to self-correction, but more due to availability of internet-scale data, but I know that reasoning models are building towards self-correction. The problem, I think, is that even reasoning models are rote because they lack information synthesis, which in biological organisms comes from the interplay between short-term and long-term memories. I am looking forward to LLMs which surpass rote and mechanical answer and reasoning capabilities.

  • I absolutely agree with information synthesis being a big missing piece in the quest to AGI. It's probably something that could eventually be conquered one way or another or just discovered by accident. However, we need to stop and think of the implications of this technology becoming a thing.

The degree to which AI can generalize previously trained networks to novel tasks is growing with time.

The degree to which I care about that has not.

It just means we can get better inference with less targeted models. Whoopdy doo

not a fan of these kinds of arguments. the 'correct' token is entirely dependent on the dataset. a LLM could have perfect training loss given a dataset, but this has no predictive power on its ability to 'answer' arbitrary prompts.

In natural language, many strings are equally valid. there are many ways to chain tokens together to get the 'correct' answer to an in sample prompt. A model with perfect loss will then for ambiguous sequences of tokens, produce a likelihood over the next tokens that corresponds to number valid token paths in the given corpus given the next token.

Compounding errors can certainly happen, but for many things upstream of the key tokens its irrelevant. There are so many ways to phrase things that are equally correct- I mean this is how language evolved (and continues to). Getting back to my first point, if you assume you have a LLM with perfect loss on the training dataset, you still can get garbage back at test time- thus i'm not sure thinking about 'compounding errors' is useful.

Errors in LLM reasoning I suspect are more closely related to noisy training data or an overabundance of low quality training data. I've observed this in how all the reasoning LLMs work, given things that are less common in the corpus of (the internet and digital assets) and require higher order reasoning, they tend to fail. Whereas these advanced math or programming problems tend to go a bit better, input data is likely much cleaner.

But for something like: how do I change the fixture on this light, I'll get back some kind of garbage from the SEO-verse. IMO next step for LLMs is figuring out how to curate an extremely high quality dataset at scale.

Lecun's thesis: "if we generate outputs that are too long, the per-token error will compound to inevitable failure".

> The finding that language models can get better by generating longer outputs directly contradicts Yann’s hypothesis.

The author's examples show that the error has been minimized for a few examples of a certain length. This doesn't contradict Lecun, afaict.

I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound. But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).

  • No this isn't right. The probabilistic formulation for autoregressive language models looks like this

         p(x_n | x_1 ... x_{n-1})
    

    which means that each token depends on all the previous tokens. Attention is one way to parameterize this. Yann's not talking about Markov chains, he's talking about all autoregressive models.

  • > But with Attention mechanism

    I would think LeCun was aware of that. Also prior sequence to sequence models like RNNs have already incorporated information about the further past.

I don't think people really want AGI. They want something else.

When we've achieved AGI, it should have the capability to make its own determination. I'm not saying we should build it with that capability. I'm saying that capability is necessary to have what I would consider to be AGI. But that would mean it is, by definition, outside of our control. If it doesn't want to do the thing, it will not do the thing.

People seem to want an expert slave. Someone with all of the technical chops to achieve a thing, but will do exactly what they're told.

And I don't think we'll ever get there.

According to the Lecun's model a human walking step by step would have the error compounding with each step and thus would never make it to whatever intended target. Yet, as a toddlers we somehow manage to learn to walk to our targets. (and i'm an MS in Math, Control Systems :)

  • A more apt analogy would be a human trying to walk somewhere with their eyes closed, i.e., what you may know as open-loop control.

  • A toddler can learn by trial and error mid-process. An LLM using autoregressive inference can only compound errors. The LLDM model paper was posted elsewhere, but: https://arxiv.org/pdf/2502.09992

    It basically uses the image generation approach of progressively refining the entire thing at once, but applied to text. It can self-correct mid-process.

    The blog post where I found it originally that goes into more detail and raises some issues with it: https://timkellogg.me/blog/2025/02/17/diffusion

    • Autoregressive vs non-autoregressive is a red herring. The non-autoregressive model is still susceptible to exponential blow up of failure rate as the output dimension increases (sequence length, number of pixels, etc). The final generation step in, eg, diffusion models is independent gaussian sampling per pixel. These models can be interpreted, like autoregressive models, as assigning log-likelihoods to the data. The average log-likelihood per token/pixel/etc can still be computed and the same "raise per unit error to the number of units power" argument for exponential failure rates still holds.

      One potential difference between autoregressive and non-autoregressive models is the types of failures which occur. Eg, typical failures in autoregressive models might look like spiralling off into nonsense once the first "error" is made, while non-autoregressive models might produce failures that tend to remain relatively "close" to the true data.

    • >A toddler can learn by trial and error mid-process.

      as a result of the whole learning process the toddler in particular learns how to self-correct itself, ie. as a grown up s/he knows, without much trial and errors anymore, how to continue in straight line if the previous step went sideways for whatever reason

      >An LLM using autoregressive inference can only compound errors.

      That is pretty powerful statement completely dismissing that some self-correction may be emerging there.

      3 replies →

We've replaced the baity generic title with the slightly less baity, but at least specific, subtitle.

(This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)

  • Thanks for the attention. Why did you change the title? I don't really care much; just curious which specific guideline you're referring to.

  • Is it that imperative voice is necessarily baity? Or is it simply the case that we can't be outwardly critical of AI itself at all anymore? I know your working hard dang, but things are getting a little fuzzy around here lately..

    https://hn.algolia.com/?q=please+dont

    https://hn.algolia.com/?q=please+stop

    • Oh for sure it is. "Please stop $Fooing" and its more aggressive cousin, "Stop $Fooing" (not to mention "For the love of god would you all please stop $Fooing or I will $Bar you" and sundry variations) belong to a family of internet linkbait tropes.

      > is it simply the case that we can't be outwardly critical of AI itself at all anymore

      You need only, er, delve into any large HN thread about AI to see that this is very far from the case! especially the more generic threads about opinion pieces and so on.

      I think the air on HN is too cynical and curmudgeonly towards new tech right now, and that worries me. Not that healthy skepticism is unwarranted (it's fine of course) but for HN itself to be healthy, there ought to be more of a balance. Cranky comments about "slop"* ought not to be the main staple here—what we want is curious conversation about interesting things—but right now it's not only the main staple, I feel like we're eating it for breakfast, lunch, and dinner.

      But I'm not immune from the bias I'm forever pointing out to other people (https://news.ycombinator.com/item?id=43134194), and that's probably why we have opposite perceptions of this!

      * (yes it annoys me too, that's not my point here though)