Comment by visiondude

8 months ago

always seemed to me that efficient caching strategies could greatly reduce costs… wonder if they cooked up something new

18 comments

visiondude

xmprt 8 months ago

How are LLMs cached? Every prompt would be different so it's not clear how that would work. Unless you're talking about caching the model weights...

hadlock 8 months ago
I've asked it a question not in it's dataset three different ways and I see the same three sentences in the response, word for word, which could imply it's caching the core answer. I hadn't previously seen this behavior before this last week.
- beering 8 months ago
  
  Isn’t the simpler explanation that if you ask the same question, there’s a chance you would get the same answer?
  In this case you didn’t even get the same answer, you only happened to have one sentence in the answer match.
HugoDias 8 months ago
This document explains the process very well. It’s a good read: https://platform.openai.com/docs/guides/prompt-caching
- xmprt 8 months ago
  
  That link explains how OpenAI uses it, but doesn't really walk through how it's any faster. I thought the whole point of transformers was that inference speed no longer depended on prompt length. So how does caching the prompt help reduce latency if the outputs aren't being cached.
  > Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt
  
  1 reply →
- catlifeonmars 8 months ago
  
  > OpenAI routes API requests to servers that recently processed the same prompt,
  My mind immediately goes to rowhammer for some reason.
  At the very least this opens up the possibility of some targeted denial of service
  
  3 replies →
amanda99 8 months ago
You would use a KV cache to cache a significant chunk of the inference work.
- xmprt 8 months ago
  
  Using KV in the caching context is a bit confusing because it usually means key-value in the storage sense of the word (like Redis), but for LLMs, it means the key and value tensors. So IIUC, the cache will store the results of the K and V matrix multiplications for a given prompt and the only computation that needs to be done is the Q and attention calculations.
- biophysboy 8 months ago
  
  Do you mean that they provide the same answer to verbatim-equivalent questions, and pull the answer out of storage instead of recalculating each time? I've always wondered if they did this.
  
  3 replies →
koakuma-chan 8 months ago

A lot of the prompt is always the same: the instructions, the context, the codebase (if you are coding), etc.
tasuki 8 months ago

> Every prompt would be different
No? Eg "how to cook pasta" is probably asked a lot.