Comment by solomatov

5 hours ago

> that he doesn't really understand how inference works from a technical perspective

Could you share what tells about it? I.e. where he was wrong about it?

1 comment

solomatov

There's examples both in his writing and also in his appearances on podcasts, interviews, etc.

I'll cherry pick a couple:

“When these new models ‘reason,’ they break a user’s input and break into component parts, then run inference on each one of those parts.” [1]

This is not at all how test-time compute works. At best, this is a very loose metaphor that he may have used out of convenience. This might sound a bit pedantic to point out, but this is a very basic thing that he's getting wrong (presumably at least, again it could be that he just used a poor metaphor).

A less pedantic example would be his claims related to gpt-5/chatgpt auto-routing. He argued that having a router means OpenAI can no longer cache static prompts, because the user prompt has to come before the hidden instructions [2]. This is just not at all how this works at inference-time. There is no evidence that the standard approach of system>developer>user instruction hierarchy has changed, the public API and caching docs maintain this.

But even more broadly, it suggests he is reasoning about kv/prefix caching at the wrong level of abstraction. It's true that conventional prefix caching does require a stable prefix, so yes, if you literally put variable user content before the static prompt, you would destroy the cacheability of that static prompt.

But that is exactly why inference systems are designed to preserve reusable prefixes where possible (via checkpointing or similar), and why serving systems care so much about prefix caching. This is also a big part of how disaggregated prefill/decode infra works where cache-aware routing is critical. His argument treats a bad prompt layout as if it were a necessary consequence of routing, rather than an avoidable implementation choice.

A router can read the user request, decide which model path to use, and then construct a normal downstream model call with stable static instructions first and user content later. Treating that as impossible implies a fundamental architectural misunderstanding.

[1] https://www.wheresyoured.at/how-to-argue-with-an-ai-booster/

[2] https://www.wheresyoured.at/how-does-gpt-5-work/