Comment by ActorNightly
11 days ago
In theory, you could have a large enough markov chain that mimicks an LLM, you would just need it to be exponentially larger in width.
After all, its just matrix multplies start to finish.
A lot of the other data operation (like normalization) can be represented as matrix multiplies, just less efficiently. In the same way that a transformer can be represented inefficiency as a set of fully connected deep layers.
True. But the considerations re: practicability are not to be ignored.