Comment by vlovich123
13 days ago
It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.
13 days ago
It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.
This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).