Comment by imiric

1 month ago

> By the time you get to day two, each turn costs tens of thousands of input tokens

This behavior surprised me when I started using LLMs, since it's so counterintuitive.

Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.

Why are all these relatively simple engineering problems still unsolved?

7 comments

imiric

iamjackg 1 month ago

It's not unsolved, at least not the first part of your question. In fact it is a feature offered by all main LLM providers!

- https://platform.openai.com/docs/guides/prompt-caching

- https://platform.claude.com/docs/en/build-with-claude/prompt...

- https://ai.google.dev/gemini-api/docs/caching

imiric 1 month ago
Ah, that's good to know, thanks.
But then why is there compounding token usage in the article's trivial solution? Is it just a matter of using the cache correctly?
- StevenWaterman 1 month ago
  
  Cached tokens are cheaper (90% discount ish) but not free
  
  2 replies →
igravious 1 month ago
dumb question, but is prompt caching available to Claude Code … ?
- stavros 1 month ago
  
  If you're using the API, yes. If you have a subscription, you don't care, as you aren't billed per prompt (you just have a limit).