Comment by mkaic

1 year ago

Counterpoint: the hidden state at the beginning of ([the][cat][is][black]) x 3 is (probably) initialized to all zeros, but after seeing those first 4 tokens, it will not be all zeros. Thus, going into the second repetition of the sentence, the model has a different initial hidden state, and should exhibit different behavior. I think this makes it possible for the model to learn to recognize repeated sequences and avoid your proposed pitfall.

The new hidden state after the first repetition will just be a linear combination between zero and what the non-recurring network outputs. After more repetitions, it will be closer to what the network outputs.