Comment by moi2388
1 day ago
“ the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.”
Why? If it’s out of domain we know it’ll fail.
> Why? If it’s out of domain we know it’ll fail.
To see if LLMs adhere to logic or observed "logical" responses are rather reproduction of patterns.
I personally enjoy this idea of isolation "logic" from "pattern" and seeing if "logic" will manifest in LLM "thinking" about in "non-patternized" domain.
--
Also it's never bad give proves to public that "thinking" (like "intelligence") in AI context isn't the same thing we think about intuitively.
--
> If it’s out of domain we know it’ll fail.
Below goes question which is out of domain. Yet LLMs handle the replies in what appearing as logical way.
``` Kookers are blight. And shmakers are sin. If peker is blight and sin who is he? ```
It is out of domain and it does not fail (I've put it through thinking gemini 2.5). Now back to article. Is observed logic intristic to LLMs or it's an elaborate form of a pattern? Acoording to article it's a pattern.
Out of domain means that the type of logic hasn’t been in the training set.
“All A are B, All C are D, X is A and B, what is X?” is not outside this domain.
I don't think we know that it'll fail, or at least that is not universally accepted as true. Rather, there are claims that given a large enough model / context window, such capabilities emerge. I think skepticism of that claim is warranted. This research validates that skepticism, at least for a certain parameters (model family/size, context size, etc).
There's a question which was rhetorically asked by Yaser S. Abu-Mostafa: "How do we know if we're learning from data?" and his answer was: "We are learning from data if we can generalize from our training set to our problem set."
To me, it feels a lot like Deming's "what gets measured gets done" (with the quiet part "...oftentimes at the expense of everything else."). Of course, the quiet part is different in this case.
What is this "domain" of which you speak? Because LLMs are supposedly good for flying airplanes, mental health, snakebites, and mushroom poisoning.
Its getting to the nub of whether models can extrapolate instead of interpolate.
If they had _succeeded_, we'd all be taking it as proof that LLMs can reason, right?