Comment by EnPissant

1 month ago

I don't think this is true.

I'm pretty sure that Codex uses reasoning.encrypted_content=true and store=false with the responses API.

reasoning.encrypted_content=true - The server will return all the reasoning tokens in an encrypted blob you can pass along in the next call. Only OpenaAI can decrypt them.

store=false - The server will not persist anything about the conversation on the server. Any subsequent calls must provide all context.

Combined the two above options turns the responses API into a stateless one. Without these options it will still persist reasoning tokens in a agentic loop, but it will be done statefully without the client passing the reasoning along each time.

4 comments

EnPissant

jumploops 1 month ago

Maybe it's changed, but this is certainly how it was back in November.

I would see my context window jump in size, after each user turn (i.e. from 70 to 85% remaining).

Built a tool to analyze the requests, and sure enough the reasoning tokens were removed from past responses (but only between user turns). Here are the two relevant PRs [0][1].

When trying to get to the bottom of it, someone from OAI reached out and said this was expected and a limitation of the Responses API (interesting sidenote: Codex uses the Responses API, but passes the full context with every request).

This is the relevant part of the docs[2]:

> In turn 2, any reasoning items from turn 1 are ignored and removed, since the model does not reuse reasoning items from previous turns.

[0]https://github.com/openai/codex/pull/5857

[1]https://github.com/openai/codex/pull/5986

[2]https://cookbook.openai.com/examples/responses_api/reasoning...

EnPissant 1 month ago
Thanks. That's really interesting. That documentation certainly does say that reasoning from previous turns are dropped (a turn being an agentic loop between user messages), even if you include the encrypted content for them in the API calls.
I wonder why the second PR you linked was made then. Maybe the documentation is outdated? Or maybe it's just to let the server be in complete control of what gets dropped and when, like it is when you are using responses statefully? This can be because it has changed or they may want to change it in the future. Also, codex uses a different endpoint than the API, so maybe there are some other differences?
Also, this would mean that the tail of the KV cache that contains each new turn must be thrown away when the next turn starts. But I guess that's not a very big deal, as it only happens once for each turn.
EDIT:
This contradicts the caching documentation: https://developers.openai.com/blog/responses-api/
Specifically:
> And here’s where reasoning models really shine: Responses preserves the model’s reasoning state across those turns. In Chat Completions, reasoning is dropped between calls, like the detective forgetting the clues every time they leave the room. Responses keeps the notebook open; step‑by‑step thought processes actually survive into the next turn. That shows up in benchmarks (TAUBench +5%) and in more efficient cache utilization and latency.
- jumploops 1 month ago
  
  I think the delta may be an overloaded use of "turn"? The Responses API does preserve reasoning across multiple "agent turns", but doesn't appear to across multiple "user turns" (as of November, at least).
  In either case, the lack of clarity on the Responses API inner-workings isn't great. As a developer, I send all the encrypted reasoning items with the Responses API, and expect them to still matter, not get silently discarded[0]:
  > you can choose to include reasoning 1 + 2 + 3 in this request for ease, but we will ignore them and these tokens will not be sent to the model.
  [0]https://raw.githubusercontent.com/openai/openai-cookbook/mai...
  
  1 reply →