Very interesting. The state management is the really insightful find here.
I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.
N.B. This is exactly how seaside, vba, and even arc[1] do server-side state generally: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).
It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.
While it seems like a good idea, resending a growing context window is very inefficient and costly. Instance pinning would make a huge efficiency gains but also collapse LLM provider revenue. This is something open models could better solve.
even a max size context window is what, ~1M? iirc tokens are generally part of a vocab of size ~300k. Assume no compression before the encryption (no clue if this is true, but compressing text before encryption can leak info regarding the message, namely how compressible it is), that's \log2 300k ~ 18 bits per token, or ~2 bytes. So each "turn" would involve ~2MB extra in each direction. And again, this is assuming max context.
the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:
>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.
262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.
Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).
That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.
from the march changes, it looked like they increased cache eviction rates on the VRAM at claude causing everyone to start burning tokens as they had to regen token state.
they still have to cache the tokens. its not completely stateless.
in theory, every conversation is replayed from the beginning. in practice, its only going to be economical to heavily cache the stable portions of the text as tokens inside the GPU
one of the reasons the Cloud providers have such heavy prompts is because that can be cached for all users, but its essentially poisonong the state before you even start. alot of the variability appears related to changing the context rather than the model.
models are expensive and the bean counters know fine tuning and context changes are cheaper. id guess the IPOs are essentially the SOTA EOL.
One possible use for the "replay across accounts": if you can get a reasoning block that jailbreaks the model, you could share that block without sharing how you did it, and others can immediately take advantage of it too.
Not necessarily for the "without sharing" part, but to increase the reliability of the jailbreak. The same prompt isn't guaranteed to return the same result, but combining the internal thinking with the prompt might be a more effective way
Why do reasoning blocks even get encrypted?
Reasoning can’t contain information that is more ‘sensitive’ than assistant response.
It is annoying to be not able to see reasoning tokens.
They are encrypted to prevent others from training on the reasoning trace. But before anthropic started encrypting the reasoning traces they were signed. The signature was to prevent the user from being able to manipulate the reasoning part of the context because this reasoning part is considered "more trustworthy" by the model. So it would be bad if the user could manipulate the reasoning to convince the model to do something dangerous/against policy. Encryption keeps this property while also preventing the reasoning from being used for training.
Sure they can. I was able to figure out Gemini 2.5 Pro's "Memory" feature's hidden system prompt because the reasoning tokens references the markdown headers by name as "blah blah says I can't refer to this", while the output would never mention them.
Yeah, I get that you can jailbreak and get that info anyway. Also that this is specific to front ends like web chat and less about API usage. But as a sibling points out it's also a good way to make post training other models harder. Mostly a "win/win" for the provider.
Are these reasoning blobs the reason ChatGPT always requests to “store data in persistent storage”?
Very interesting. The state management is the really insightful find here.
I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.
N.B. This is exactly how seaside, vba, and even arc[1] do server-side state generally: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).
It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.
[1]: As in, pg's lisp: https://arclanguage.github.io/ref/srv.html#:~:text=The%20pre...
And don't forget the venerable .NET Forms with its kilobytes of __VIEWSTATE
1 reply →
Do they mitigate replay attacks?
While it seems like a good idea, resending a growing context window is very inefficient and costly. Instance pinning would make a huge efficiency gains but also collapse LLM provider revenue. This is something open models could better solve.
even a max size context window is what, ~1M? iirc tokens are generally part of a vocab of size ~300k. Assume no compression before the encryption (no clue if this is true, but compressing text before encryption can leak info regarding the message, namely how compressible it is), that's \log2 300k ~ 18 bits per token, or ~2 bytes. So each "turn" would involve ~2MB extra in each direction. And again, this is assuming max context.
seems plausibly fine
Can you elaborate? How could it be more efficient and bad for revenue? Would it also be bad for profit?
1 reply →
the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:
>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.
262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.
6 replies →
Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).
That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.
from the march changes, it looked like they increased cache eviction rates on the VRAM at claude causing everyone to start burning tokens as they had to regen token state.
they still have to cache the tokens. its not completely stateless.
in theory, every conversation is replayed from the beginning. in practice, its only going to be economical to heavily cache the stable portions of the text as tokens inside the GPU
one of the reasons the Cloud providers have such heavy prompts is because that can be cached for all users, but its essentially poisonong the state before you even start. alot of the variability appears related to changing the context rather than the model.
models are expensive and the bean counters know fine tuning and context changes are cheaper. id guess the IPOs are essentially the SOTA EOL.
One possible use for the "replay across accounts": if you can get a reasoning block that jailbreaks the model, you could share that block without sharing how you did it, and others can immediately take advantage of it too.
Not necessarily for the "without sharing" part, but to increase the reliability of the jailbreak. The same prompt isn't guaranteed to return the same result, but combining the internal thinking with the prompt might be a more effective way
Why do reasoning blocks even get encrypted? Reasoning can’t contain information that is more ‘sensitive’ than assistant response. It is annoying to be not able to see reasoning tokens.
They are encrypted to prevent others from training on the reasoning trace. But before anthropic started encrypting the reasoning traces they were signed. The signature was to prevent the user from being able to manipulate the reasoning part of the context because this reasoning part is considered "more trustworthy" by the model. So it would be bad if the user could manipulate the reasoning to convince the model to do something dangerous/against policy. Encryption keeps this property while also preventing the reasoning from being used for training.
Sure they can. I was able to figure out Gemini 2.5 Pro's "Memory" feature's hidden system prompt because the reasoning tokens references the markdown headers by name as "blah blah says I can't refer to this", while the output would never mention them.
Yeah, I get that you can jailbreak and get that info anyway. Also that this is specific to front ends like web chat and less about API usage. But as a sibling points out it's also a good way to make post training other models harder. Mostly a "win/win" for the provider.
It's to prevent you training another model to emulate the reasoning. They want it to be their moat.
Very cool idea to use thinking duration (either in tokens or in wall time) as a side-channel!
Awesome write-up. Seems like a great way to play with model responses now that prefill is gone.
Also commenting to say I really enjoyed it.
Super cool side channel attack. I tend to agree that it's pretty impractical, but it's such a fun discovery!
[dead]