Comment by choppaface

8 hours ago

or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)

2 comments

choppaface

londons_explore 7 hours ago

If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)

cyanydeez 6 hours ago

https://platform.claude.com/docs/en/build-with-claude/prompt...
suggests the can cache outside the gpu.