Comment by d4rkp4ttern
2 months ago
For me one of the most interesting aspects is how compaction works. It turns out compaction still preserves the full original pre-compaction conversation in the session jsonl file, and those are marked as "not to be sent to the API". Which means, even after compaction, if you think something was lost, you can tell CC to "look in the session log files to find details about what we did with XYZ". I knew this before the leak since it can be seen from the session logs. Some more details:
The full conversation is preserved in the JSONL file, and messages
are filtered before being sent to the API.
Key mechanisms:
1. JSONL is append-only — old pre-compaction messages are never deleted. New messages (boundary
marker, summary, attachments) are appended after compaction.
2. Messages have flags controlling API visibility:
- isCompactSummary: true — marks the AI-generated summary message
- isVisibleInTranscriptOnly: true — prevents a message from being sent to the API
- isMeta — another filter for non-API messages
- getMessagesAfterCompactBoundary() returns only post-compaction messages for API calls
3. After compaction, the API sees only:
- The compact boundary marker
- The summary message
- Attachments (file refs, plan, skills)
- Any new messages after compaction
4. Three compaction types exist:
- Full compaction — API summarizes all old messages
- Session memory compaction — uses extracted session memory as summary (cheaper)
- Microcompaction — clears old tool result content when cache is cold (>1h idle)
What is microcompaction? I didn’t realize there was any thing time based in CC, when I go eat dinner and come back it compacted while I was gone?
I dug into this more. It's disabled by default, and it's a cost/token-usage optimization.
[flagged]
> it's basically a cost optimization masquerading as a feature
Cost optimization in the user's favor.
Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.
Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).
Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.
To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.
So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.