Comment by kingstnap

9 days ago

The reduction is in cached inputs. I've commented about this before but many labs, except Deepseek and Xaomi now, absolutely scam you for cached reads.

You are basically paying out the nose for a few seconds of VRAM residence if you are giving significant money for cache reads.

The very nature of autoregressive language modeling is that every single output token produced "reads" the cache.

So in principle the price floor for a cache hit is the flat cost of 1 output token.

Now in reality it has to be more than that because you are occupying VRAM with the cache that forces out other users. But it can still be really cheap.

No one is producing one output token though.

And using up gpus for that cache is a pretty big opportunity cost. I highly doubt it's done in vram. That would be insane for the one hour caches.

So its memory + the time it takes to unload/load into vram + the extra cost per output token

Is it a scam? Idk