Comment by d4rkp4ttern

2 months ago

For me one of the most interesting aspects is how compaction works. It turns out compaction still preserves the full original pre-compaction conversation in the session jsonl file, and those are marked as "not to be sent to the API". Which means, even after compaction, if you think something was lost, you can tell CC to "look in the session log files to find details about what we did with XYZ". I knew this before the leak since it can be seen from the session logs. Some more details:

  The full conversation is preserved in the JSONL file, and messages
  are filtered before being sent to the API.

  Key mechanisms:

  1. JSONL is append-only — old pre-compaction messages are never deleted. New messages (boundary
  marker, summary, attachments) are appended after compaction.
  2. Messages have flags controlling API visibility:
    - isCompactSummary: true — marks the AI-generated summary message
    - isVisibleInTranscriptOnly: true — prevents a message from being sent to the API
    - isMeta — another filter for non-API messages
    - getMessagesAfterCompactBoundary() returns only post-compaction messages for API calls
  3. After compaction, the API sees only:
    - The compact boundary marker
    - The summary message
    - Attachments (file refs, plan, skills)
    - Any new messages after compaction
  4. Three compaction types exist:
    - Full compaction — API summarizes all old messages
    - Session memory compaction — uses extracted session memory as summary (cheaper)
    - Microcompaction — clears old tool result content when cache is cold (>1h idle)

What is microcompaction? I didn’t realize there was any thing time based in CC, when I go eat dinner and come back it compacted while I was gone?

  • I dug into this more. It's disabled by default, and it's a cost/token-usage optimization.

      The logic is:
    
      1. Anthropic's API has a server-side prompt cache with a 1-hour TTL
      2. When you're actively using a session, each API call reuses the cached prefix — you only pay
      for new tokens
      3. After 1 hour idle, that cache is guaranteed expired
      4. Your next message will re-send and re-process the entire conversation from scratch — every
      token, full price
      5. So if you have 150K tokens of old Grep/Read/Bash outputs sitting in the conversation, you're
      paying to re-ingest all of that even though it's stale context the model probably doesn't need
    
      The microcompact says: "since we're paying full price anyway, let's shrink the bill by clearing
      the bulky stuff."
    
      What's preserved vs lost:
      - The tool_use blocks (what tool was called, with what arguments) — kept
      - The tool_result content (the actual output) — replaced with [Old tool result content cleared]
      - The most recent 5 tool results — kept
    
      So Claude can still see "I ran Grep for foo in src/" but not the 500-line grep output from 2
      hours ago.
    
      Does it affect quality? Yes, somewhat — but the tradeoff is that without it, you're paying
      potentially tens of thousands of tokens to re-ingest stale tool outputs that the model already
      acted on. And remember, if the conversation is long enough, full compaction would have summarized
       those messages anyway.
    
      And critically: this is disabled by default (enabled: false in timeBasedMCConfig.ts:31). It's
      behind a GrowthBook feature flag that Anthropic controls server-side. So unless they've flipped
      it on for your account, it's not happening to you.

[flagged]

  • > it's basically a cost optimization masquerading as a feature

    Cost optimization in the user's favor.

    Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.

    Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).

    Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.

    To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.

    So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.