Comment by immibis
1 year ago
This architecture, on the surface, seems to preclude the basic function of recognizing sequences of tokens. At the very least, it seems like it should suffer from something like the pumping lemma: if [the ][cat ][is ][black ] results in the output getting close to a certain vector, [the ][cat ][is ][black ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get even closer to that vector and nowhere close to a "why did you just repeat the same sentence three times" vector? Without non-linear mixing between input token and hidden state, there will be a lot of linear similarities between similar token sequences...
Counterpoint: the hidden state at the beginning of ([the][cat][is][black]) x 3 is (probably) initialized to all zeros, but after seeing those first 4 tokens, it will not be all zeros. Thus, going into the second repetition of the sentence, the model has a different initial hidden state, and should exhibit different behavior. I think this makes it possible for the model to learn to recognize repeated sequences and avoid your proposed pitfall.
The new hidden state after the first repetition will just be a linear combination between zero and what the non-recurring network outputs. After more repetitions, it will be closer to what the network outputs.