Comment by AceJohnny2

2 months ago

> it's basically a cost optimization masquerading as a feature

Cost optimization in the user's favor.

Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.

Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).

Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.

To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.

So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.

0 comments

AceJohnny2

No comments yet

Contribute on Hacker News ↗