← Back to context

Comment by Imnimo

1 day ago

I have become a little more skeptical of LLM "reasoning" after DeepSeek (and now Grok) let us see the raw outputs. Obviously we can't deny the benchmark numbers - it does get the answer right more often given thinking time, and it does let models solve really hard benchmarks. Sometimes the thoughts are scattered and inefficient, but do eventually hit on the solution. Other times, it seems like they fall into the kind of trap LeCun described.

Here are some examples from playing with Grok 3. My test query was, "What is the name of a Magic: The Gathering card that has all five vowels in it, each occurring exactly once, and the vowels appear in alphabetic order?" The motivation here is that this seems like a hard question to just one-shot, but given sufficient ability to continue recalling different card names, it's very easy to do guess-and-check. (For those interested, valid answers include "Scavenging Ghoul", "Angelic Chorus" and others)

In one attempt, Grok 3 spends 10 minutes (!!) repeatedly checking whether "Abian, Luvion Usurper" satisfies the criteria. It'll list out the vowels, conclude it doesn't match, and then go, "Wait, but let's think differently. Maybe the card is "Abian, Luvion Usurper," but no", and just produce variants of that thinking. Counting occurences of the word "Abian" suggests it tested this theory 800 times before eventually timing out (or otherwise breaking), presumably just because the site got overloaded.

In a second attempt, it decides to check "Our Market Research Shows That Players Like Really Long Card Names So We Made this Card to Have the Absolute Longest Card Name Ever Elemental" (this a real card from a joke set). It attempts to write out the vowels:

>but let's check its vowels: O, U, A, E, E, A, E, A, E, I, E, A, E, O, A, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E ...

It continues like this for about 600 more vowels, before emitting a random Russian(?) word and breaking out:

>...E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O продуктив

These two examples seem like the sort of failures LeCun conjectured. The model gets into a cycle self-reinforced unproductive behavior. Every time it checks Abian, or emits another "AEEEAO", it becomes even more probable that the next tokens should be the same.

I did some testing with the new Gemini model on some OCR tasks recently. One of the failures was it just getting stuck and repeating the same character sequence ad-infinitum until timing out. It's a great failure mode when you charge by the token :D

  • I've seen similar things with claude and OCR with low temperature. Higher temperature, 0.8, resolved it for me. But I was using low temp for reproducibility so

I think this is valid criticism, but it's also unclear how much this is an "inherent" shortcoming vs the kind of thing that's pretty reasonable given we're really seeing the first generation of this new model paradigm.

Like, I'm as sceptical of just assuming "line goes up" extrapolation of performance as much as anyone, but assuming that current flaws are going to continue being flaws seems equally wrong-headed/overconfident. The past 5 years or so has been a constant trail of these predictions being wrong (remember when people thought artists would be safe cos clearly AI just can't do hands?). Now that everyone's woken up to this RL approach we're probably going to see very quickly over the next couple years how much these issues hold up

(Really like the problem though, seems like a great test)

  • Yeah, that's a great point. While this is evidence that the sort of behavior LeCun predicted is currently displayed by some reasoning models, it would be going too far to say that it's evidence it will always be displayed. In fact, one could even have a more optimistic take - if models that do this can get 90+% on AIME and so on, imagine what a model that had ironed out these kinks could do with the same amount of thinking tokens. I feel like we'll just have to wait and see whether that pans out.

I don't know whether treating a model as a database is really a good measure.

  • Yeah, I'm not so much interested in "can you think of the right card name from among thousands?". I just want to see that it can produce a thinking procedure that makes sense. If it ends up not being able to recall the right name despite following a good process of guess-and-check, I'd still consider that a satisfactory result.

    And to the models' credit, they do start off with a valid guess-and-check process. They list cards, write out the vowels, and see whether it fits the criteria. But eventually they tend to go off the rails in a way that is worrying.

What did I miss about DeepSeek?

  • Just that it's another model where you can read the raw "thinking" tokens, and they sometimes fall into this sort of rut (as opposed to OpenAI's models, for which summarized thinking may be hiding some of this behavior).