Comment by jumploops

1 month ago

One thing that surprised me when diving into the Codex internals was that the reasoning tokens persist during the agent tool call loop, but are discarded after every user turn.

This helps preserve context over many turns, but it can also mean some context is lost between two related user turns.

A strategy that's helped me here, is having the model write progress updates (along with general plans/specs/debug/etc.) to markdown files, acting as a sort of "snapshot" that works across many context windows.

36 comments

jumploops

EnPissant 1 month ago

I don't think this is true.

I'm pretty sure that Codex uses reasoning.encrypted_content=true and store=false with the responses API.

reasoning.encrypted_content=true - The server will return all the reasoning tokens in an encrypted blob you can pass along in the next call. Only OpenaAI can decrypt them.

store=false - The server will not persist anything about the conversation on the server. Any subsequent calls must provide all context.

Combined the two above options turns the responses API into a stateless one. Without these options it will still persist reasoning tokens in a agentic loop, but it will be done statefully without the client passing the reasoning along each time.

jumploops 1 month ago
Maybe it's changed, but this is certainly how it was back in November.
I would see my context window jump in size, after each user turn (i.e. from 70 to 85% remaining).
Built a tool to analyze the requests, and sure enough the reasoning tokens were removed from past responses (but only between user turns). Here are the two relevant PRs [0][1].
When trying to get to the bottom of it, someone from OAI reached out and said this was expected and a limitation of the Responses API (interesting sidenote: Codex uses the Responses API, but passes the full context with every request).
This is the relevant part of the docs[2]:
> In turn 2, any reasoning items from turn 1 are ignored and removed, since the model does not reuse reasoning items from previous turns.
[0]https://github.com/openai/codex/pull/5857
[1]https://github.com/openai/codex/pull/5986
[2]https://cookbook.openai.com/examples/responses_api/reasoning...
- EnPissant 1 month ago
  
  Thanks. That's really interesting. That documentation certainly does say that reasoning from previous turns are dropped (a turn being an agentic loop between user messages), even if you include the encrypted content for them in the API calls.
  I wonder why the second PR you linked was made then. Maybe the documentation is outdated? Or maybe it's just to let the server be in complete control of what gets dropped and when, like it is when you are using responses statefully? This can be because it has changed or they may want to change it in the future. Also, codex uses a different endpoint than the API, so maybe there are some other differences?
  Also, this would mean that the tail of the KV cache that contains each new turn must be thrown away when the next turn starts. But I guess that's not a very big deal, as it only happens once for each turn.
  EDIT:
  This contradicts the caching documentation: https://developers.openai.com/blog/responses-api/
  Specifically:
  > And here’s where reasoning models really shine: Responses preserves the model’s reasoning state across those turns. In Chat Completions, reasoning is dropped between calls, like the detective forgetting the clues every time they leave the room. Responses keeps the notebook open; step‑by‑step thought processes actually survive into the next turn. That shows up in benchmarks (TAUBench +5%) and in more efficient cache utilization and latency.
  
  2 replies →

CjHuber 1 month ago

It depends on the API path. Chat completions does what you describe, however isn't it legacy?

I've only used codex with the responses v1 API and there it's the complete opposite. Already generated reasoning tokens even persist when you send another message (without rolling back) after cancelling turns before they have finished the thought process

Also with responses v1 xhigh mode eats through the context window multiples faster than the other modes, which does check out with this.

jumploops 1 month ago

That’s what I used to think, before chatting with the OAI team.
The docs are a bit misleading/opaque, but essentially reasoning persists for multiple sequential assistant turns, but is discarded upon the next user turn[0].
The diagram on that page makes it pretty clear, as does the section on caching.
[0]https://cookbook.openai.com/examples/responses_api/reasoning...
jswny 1 month ago

How do you know/toggle which API path you are using?

xg15 1 month ago

I think it might be a good decision though, as it might keep the context aligned with what the user sees.

If the reasoning tokens where persisted, I imagine it would be possible to build up more and more context that's invisible to the user and in the worst case, the model's and the user's "understanding" of the chat might diverge.

E.g. image a chat where the user just wants to make some small changes. The model asks whether it should also add test cases. The user declines and tells the model to not ask about it again.

The user asks for some more changes - however, invisibly to the user, the model keeps "thinking" about test cases, but never telling outside of reasoning blocks.

So suddenly, from the model's perspective, a lot of the context is about test cases, while from the user's POV, it was only one irrelevant question at the beginning.

olliepro 1 month ago

I made a skill that reflects on past conversations via parallel headless codex sessions. Its great for context building. Repo: https://github.com/olliepro/Codex-Reflect-Skill

hedgehog 1 month ago

This is effective and it's convenient to have all that stuff co-located with the code, but I've found it causes problems in team environments or really anywhere where you want to be able to work on multiple branches concurrently. I haven't come up with a good answer yet but I think my next experiment is to offload that stuff to a daemon with external storage, and then have a CLI client that the agent (or a human) can drive to talk to it.

hhmc 1 month ago
git worktrees are the canonical solution
- hedgehog 1 month ago
  
  worktrees are good but they solve a different problem. Question is, if you have a lot of agent config specific to your work on a project where do you put it? I'm coming around to the idea that checked in causes enough problems it's worth the pain to put it somewhere else.
  
  1 reply →
- fragmede 1 month ago
  
  worktrees are a bunch of extra effort. if your code's well segregated, and you have the right config, you can run multiple agents in the same copy of the repo at the same time, so long as they're working on sufficiently different tasks.
  
  3 replies →

vmg12 1 month ago

I think this explains why I'm not getting the most out of codex, I like to interrupt and respond to things i see in reasoning tokens.

behnamoh 1 month ago
that's the main gripe I have with codex; I want better observability into what the AI is doing to stop it if I see it going down the wrong path. in CC I can see it easily and stop and steer the model. in codex, the model spends 20m only for it to do something I didn't agree on. it burns OpenAI tokens too; they could save money by supporting this feature!
- zeroxfe 1 month ago
  
  You're in luck -- /experimetal -> enable steering.
  
  1 reply →

ljm 1 month ago

I’ve been using agent-shell in emacs a lot and it stores transcripts of the entire interaction. It’s helped me out lot of times because I can say ‘look at the last transcript here’.

It’s not the responsibility of the agent to write this transcript, it’s emacs, so I don’t have to worry about the agent forgetting to log something. It’s just writing the buffer to disk.

crorella 1 month ago

Same here! I think it would be good if this could be made by default by the tooling. I've seen others using SQL for the same and even the proposal for a succinct way of representing this handoff data in the most compact way.

sdwr 1 month ago

That could explain the "churn" when it gets stuck. Do you think it needs to maintain an internal state over time to keep track of longer threads, or are written notes enough to bridge the gap?

pcwelder 1 month ago

Sonnet has the same behavior: drops thinking on user message. Curiously in the latest Opus they have removed this behavior and all thinking tokens are preserved.

behnamoh 1 month ago

but that's why I like Codex CLI, it's so bare bone and lightweight that I can build lots tools on top of it. persistent thinking tokens? let me have that using a separate file the AI writes to. the reasoning tokens we see aren't the actual tokens anyway; the model does a lot more behind the scenes but the API keeps them hidden (all providers do that).

postalcoder 1 month ago
Codex is wicked efficient with context windows, with the tradeoff of time spent. It hurts the flow state, but overall I've found that it's the best at having long conversations/coding sessions.
- behnamoh 1 month ago
  
  yeah it throws me out of the "flow", which I don't like. maybe the cerebras deal helps with that.
  
  2 replies →

dayone1 1 month ago

where do you save the progress updates in? and do you delete them afterwards or do you have like 100+ progress updates each time you have claude or codex implement a feature or change?

lighthouse1212 1 month ago

[dead]

lighthouse1212 1 month ago

[dead]