Comment by NitpickLawyer
15 hours ago
Unless I'm parsing your reply very badly, I see no world in which anything dealing with HTTP would be more expensive than dealing with kv cache (loading from "cold" storage, deciding which compute unit to load it into, doing the actual computations for the next call, etc).
No, that’s not the issue. What people fail to understand is that every request - eg every message you send, but also tool call responses - require the entire conversation history to be sent, and the LLM providers need to reprocess things.
The attention part of LLMs (that is, for every token, how much their attention is to all other tokens) is cached in a KV cache.
You can imagine that with large context windows, the overhead becomes enormous (attention has exponential complexity).