Comment by anamax
2 days ago
> My suspicion is that we are in Ptolemaic state as far as GPT like models are concerned. We will eventually understand them better once we figure out what's the better coordinate system to think about their dynamics in.
Most deep learning systems are learned matrices that are multiplied by "problem-instance" data matrices to produce a prediction matrix. The time to do said matrix-multiplication is data-independent (assuming that the time to do multiply-adds is data-independent).
If you multiply both sides by the inverse of the learned matrix, you get an equation where finding the prediction matrix is a solving problem, where the time to solve is data dependent.
Interestingly enough, that time is sort-of proportional to the difficulty of the problem for said data.
Perhaps more interesting is that the inverse matrix seems to have row artifacts that look like things in the training data.
These observations are due to Tsvi Achler.
Neural nets are quite a bit more than matrix multiplications, at least in their current representation.
There are layers upon layers of nonlinearity, be it with softmax or sigmoid. In the tangent kernel view it does linearize.