Comment by brookst

6 days ago

But that’s not accurate. There are all sorts of tricks around KV cache where different users will have the same first X bytes because they share system prompts, caching entire inputs / outputs when the context and user data is identical, and more.

Not sure if you were just joking or really believe that, but for other peoples’ sake, it’s wildly wrong.

4 comments

brookst

kossTKR 5 days ago

Really? So the system recognises someone asked the same question and serves the same answer? And who on earth shares the exact same context?

I mean i get the idea but sounds so incredibly rare it would mean absolutely nothing optimisation wise.

brookst 4 days ago
Yes. It is not incredibly rare, it's incredibly common. A huge percentage of queries to retail LLMs are things like "hello" and "what can you do", with static system prompts that make the total context identical.
It's worth maybe a 3% reduction in GPU usage. So call it a half billion dollars a year or so, for a medium to large service.
- throwaway2037 1 day ago
  
  > It's worth maybe a 3% reduction in GPU usage. So call it a half billion dollars a year or so, for a medium to large service.
  So if 3% is 500M, then annual spend is ~16.6B. That is medium sized these days?
fc417fc802 5 days ago

Even if that were the case you wouldn't be wrong. Adding caching and deduplication (and clever routing and sharding, and ...) on top of timesharing doesn't somehow make it not timesharing anymore. The core observation about the raw numbers still applies.