Comment by foobarqux
6 months ago
The real problem with the paper is not any of the mathematical details that others have described it is more fundamental. Chomsky's claim is that humans have a distinctive property that they seem to not be able to process certain synthetic language constructions --- namely linear (non-hierarchical) languages --- as well as synthetic human-like (hierarchical) languages and they use a different part of the brain to do so. This was shown in experiments (see Moro, Secrets of Words, I think his nature paper also cites the studies).
Because the synthetic linear languages are computationally/structurally simple LLMs will, unlike humans, learn them just as easily as real human languages. Since this hierarchical aspect of human language seems fundamental/important LLMs therefore are not a good model of the human language faculty.
If you want to refute that claim then you would take similar synthetic language constructions to those that were used in the experiments and show that LLMs take longer to learn them.
Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).
The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.
> Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).
Thanks for your feedback. I think our manipulations do establish that there are nontrivial inductive biases in Transformer language models and that these inductive biases are aligned with human language in important ways. There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't. It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.
Furthermore, this alignment is interesting even if it isn't perfect. I would be shocked in GPT language models happened to have inductive biases that perfectly match the structure of human language---why would they? But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do. As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits. Something similar is happening now with neural language models as models of language learning and processing, a very active research field. So I wouldn't say that neural language models can't shed any light on language simply because they're not a perfect match for a particular aspect of language.
As for using languages more directly based on the Moro experiments, we've discussed this extensively. There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set, where the control condition is a real language instead of a toy language, without introducing confounds of various kinds. We're open to suggestions. We've had very productive conversations with syntacticians about how to formulate new baselines in future work.
More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully, to the point that they can be used to test the inductive biases of neural models. It's not as simple as hierarchical vs. linear, since there are purely linear phenomena in syntax such as Closest Conjunct Agreement, and also morphophonological processes can act linearly across constituent boundaries, among other complications.
> The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.
I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences.
> these inductive biases are aligned with human language in important ways.
They aren’t, which is the entire point of this conversation, and simply asserting otherwise isn’t an argument.
> It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.
This is a non-sensical argument: consider if you had studied a made up language that required you to factor numbers or do something else inherently computationally expensive. LLMs would favor simplicity bias “just like humans” but it’s obvious this doesn’t tell you anything and specifically doesn’t tell you that LLMs are like humans in any useful sense.
> There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't.
You are missing the point, which is that humans cannot as easily learn Moro languages while LLMs can. Therefore LLMs are different in a fundamental way from humans. This difference is so fundamental that you need to give strong, specific, explicit justification why LLMs are useful in explaining humans. The only reason I used the word “simple” is to argue that LLMs would be able to learn it easily (without even having to run an experiment) but the same would be true if LLMs learned a non-simple language that humans couldn’t.
Again it doesn’t matter if you find all the ways that humans and LLMs are the same —- for example that they both struggle with shuffled sentences or with a language that involves factoring numbers —— what matters is that there exists a fundamental difference between them exemplified by the Moro languages.
> But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do.
Why? There is no reason to believe you will learn anything from it. This is a bizarre abstract argument that doing something is useful because you might learn something from it. You can say that about anything you do. There is a video on YouTube where Chomsky engages with someone making similar arguments about chess computers. Chomsky said that there wasn’t any self evident reason why studying chess playing computers would tell you anything about humans. He was correct, we never did learn anything significant about humans from chess computers.
> As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits.
There is a difference between pursuing a reasonable line of inquiry and having it fail versus pursuing one that you know or ought to know is flawed. If someone had pointed out the problems with CFG at the outset it would have been foolish to pursue it, just as it is foolish to ignore the Moro problem now.
> There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set
I can’t imagine what those challenges are, I don’t remember the details but I believe Moro made systematic simple grammar changes. Your Hop is in the same vein.
> where the control condition is a real language
Why does the control need to be a real language? Moro did not use a real language control on humans. (Edit: Because you want to use pre-trained models?).
> More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully
Again you’ve invented an abstract problem to study that has no bearing on the problem that Chomsky has described. Moro showed that humans struggle with certain synthetic grammar constructions. Chomsky noted that LLMs do not have this important feature. You are now trying to take this concrete observation about humans and turning it into the abstract field of the study of “impossible languages”.
> It's not as simple as hierarchical vs. linear
There are different aspects of language but there is a characteristic feature missing from LLMs which makes them unsuitable as models for human language. It doesn’t make any sense for a linguist to care about LLMs unless you provide justification for why they would learn anything about the human language faculty from LLMs despite that fundamental difference.
> I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences
Even if this were true we return to “no evidence” rather than “evidence against”. But it is very unlikely that Moro-languages are any more difficult for LLMs to learn because, as I said earlier, they are very computationally simple, simpler than hierarchical languages.