← Back to context

Comment by bontaq

18 hours ago

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

  • If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

    If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?

    • > I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.

      Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.

    • Yes, that is indeed O(N^2). Which, by the way, is not exponential.

      Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

      1 reply →

    • What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).

      This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.