Comment by vlovich123

3 months ago

It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.

1 comment

vlovich123

hexomancer 3 months ago

This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).