Comment by TeMPOraL
8 hours ago
That's why the harness is moving server-side: because generating tokens is not the actual point of the model, not for the users. Especially with tool calling giving us agents that can act, most of the tokens generated are not, themselves, critical to the end users. Specifically, a lot of tokens goes into orchestrating actual tool calls, and then most "thinking tokens" are only relevant to users only in so far as they help users keep track of and verify what the LLM is doing. So all those tokens can be hidden or replaced by partial summaries, and all of that can happen server-side, and then there's very little to distill from.
I haven't heard of this happening, do you have links any explainers on this?