Comment by imiric
14 hours ago
> By the time you get to day two, each turn costs tens of thousands of input tokens
This behavior surprised me when I started using LLMs, since it's so counterintuitive.
Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.
Why are all these relatively simple engineering problems still unsolved?
It's not unsolved, at least not the first part of your question. In fact it is a feature offered by all main LLM providers!
- https://platform.openai.com/docs/guides/prompt-caching
- https://platform.claude.com/docs/en/build-with-claude/prompt...
- https://ai.google.dev/gemini-api/docs/caching
Ah, that's good to know, thanks.
But then why is there compounding token usage in the article's trivial solution? Is it just a matter of using the cache correctly?
Cached tokens are cheaper (90% discount ish) but not free
2 replies →
dumb question, but is prompt caching available to Claude Code … ?
If you're using the API, yes. If you have a subscription, you don't care, as you aren't billed per prompt (you just have a limit).