Comment by mirekrusin
9 hours ago
Hold on their evaluation tasks are based on rotating letters in text? Isn't this known weak area for token based models?
9 hours ago
Hold on their evaluation tasks are based on rotating letters in text? Isn't this known weak area for token based models?
I think that's the point, really: It's a reliable and reproducible weakness, but also one where the model can be trained to elicit impressive-looking "reasoning" about what the problem is and how it "plans" to overcome it.
Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
Kind of like a a Chinese Room scenario: If the other end appears to talk about algebra perfectly well, but just can't do it, that's evidence you might be talking to a language-lookup machine instead of one that can reason.
Reminds me of a number of grad students I knew who could “talk circles” around all sorts of subjects but failed to ever be able to apply anything.
Heh, but just because a human can fail at something doesn't mean everything that fails at it is human. :p
> Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
That doesn't follow, if the weakness of the model manifests on a different level we wouldn't call rational in a human.
For example, a human might have dyslexia, a disorder on the perceptive level. A dyslexic can understand and explain his own limitation, but that doesn't help him overcome it.
Typically when a human has a disorder or limitation they adapt to it by developing coping strategies or making use of tools and environmental changes to compensate. Maybe they expect a true reasoning model to be able to do the same thing?