Comment by Roritharr

2 days ago

I've thought about the high-jacking of reasoning-chains as a potential vector, but never saw a proven implementation in american models since, from my understanding, all major vendors throw out the reasoning tokens between turns.

For Claude, at least, "throw out the reasoning tokens" is only true when a session has been idle for more than an hour, and is new since March.

The basic concept is that for a session active recently, interleaved thinking tokens are already in KV cache, so it's more efficient to keep using them than not! But when resuming an older session where KV cache has been evicted, it's more expensive to restore the thinking tokens, so they're silently dropped from prior turns. It's 2026 and stateful servers are back on the menu!

https://news.ycombinator.com/item?id=47884517 indicates OpenAI drops reasoning tokens "smartly" at its own election, which is likely a similar performance optimization.)

I've experimented with rules to have Claude Code be explicit about recapping its thinking tokens, including tool choices and approaches chosen and rejected, into actual message output, but this is lossy at best. And sometimes dropping reasoning tokens can give a session "fresh eyes" in a good way.

I just really don't like the lack of control, and it's a reminder of how ephemeral the current landscape is. The Claude giveth, and the Claude taketh away.

  • its mostly annoying in that you give opus a big job, that should be able to run for hours on end, but instead it tries to stop and checkpoint at every soonest possible moment even though the rest of the work is well specced and ready to go.

    then it waits for the hour and gets dumbed down

  • I think you're confusing two different axes. There is a difference between the cache state and the context state.

    Imagine a conversation with turns X, Y, and Z. When the LLM "reasons" about the next token A it does: P(A | X,Y,Z) and then P(B | X,Y,Z,A), etc. It will eventually produce a result P(D | X,Y,Z,A,B,C). Instead of continuing the context from X,Y,Z,A,B,C it continues it from X,Y,Z so you have P(N | X,Y,Z,D). This is what is meant by dropping the reasoning. This is done to save cache context for the session.

    This is a different thing than preserving the K/V state of P(N | X,Y,Z,D).

    • No, I think the comment you're responding to is actually correct. Look at this quote from the Anthropic blog post again:

      > The design should have been simple: if a session has been idle for more than an hour, we could reduce users’ cost of resuming that session by clearing old thinking sections. Since the request would be a cache miss anyway, we could prune unnecessary messages from the request to reduce the number of uncached tokens sent to the API. We’d then resume sending full reasoning history. To do this we used the clear_thinking_20251015 API header along with keep:1.

      They clearly make the same distinction between the cache and the context. They're saying "we could reduce users’ cost of resuming that session by clearing old thinking sections". They intentionally created a behavior different between cached and uncached requests, specifically they clear thinking sections from the context for requests that miss the cache.

OAI is now implementing encrypted CoT that you can store and pass back between turns (harness call), so new models have it https://developers.openai.com/api/docs/guides/reasoning#encr...

  • You could also use the responses api which stores all message contents (including reasoning) on OAI servers. This has been possible for quite a while now. Encryption is only necessary if you really care about local storage (which is different from privacy concerns, because the data gets sent to their servers anyway).

    • well the encryption part is also mostly about OAI wanting to avoid others to distill from their COT/reasoning traces, since this is not ever displayed to devs or final users, and as you say lives on their servers.

      but yes you're correct on the responses api already baking it in too

      supposedly keeping these between tool calls should help the model reason and have better overall outputs etc

> all major vendors throw out the reasoning tokens between turns

That would be surprising to me. The reasoning _is_ the model intelligence in a lot of respects, and so dropping those from the context would affect its output pretty significantly.

I assume that instead they just have a lot of guardrails in place and multiple runtime environments that an individual turns ping-pong between in order to dehydrate/rehydrate the reasoning to keep it hidden from the end user.

  • Anthropic very explicitly says below their diagrams ( https://platform.claude.com/docs/en/build-with-claude/contex... ) on this:

    "Stripping extended thinking: Extended thinking blocks (shown in dark gray) are generated during each turn's output phase, but are not carried forward as input tokens for subsequent turns. You do not need to strip the thinking blocks yourself. The Claude API automatically does this for you if you pass them back."

    It's more nuanced in the various modes, but i haven't seen it boil down towards Thinking Tokens surviving more than two turns.

    • https://platform.claude.com/docs/en/build-with-claude/extend...

      default depends on the model class. Opus: Claude Opus 4.5 and later Opus models keep all prior thinking blocks; Claude Opus 4.1 (deprecated) and earlier Opus models keep only the last assistant turn's thinking. Sonnet: Claude Sonnet 4.6 and later Sonnet models keep all; Claude Sonnet 4.5 and earlier Sonnet models keep only the last turn. Haiku: all Haiku models through Claude Haiku 4.5 keep only the last turn. Claude Mythos Preview also keeps all prior thinking blocks.

      1 reply →

    • Thats really surprising, I stand corrected. I have had a lot of issues with hallucinations I attributed to adaptive thinking, but I wonder if those were actually due to this behavior instead.

      I also wonder if they actually do a hybrid of "standard reasoning" and then classify this stripped chain of thought as "extended thinking".

Gemini models return a thinking signature that you, I think, must send back when invoking further, so they seem to keep them?