Comment by genewitch
19 hours ago
I'm curious if anyone has logged the number of thinking tokens over time. My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
they get to see (if not opted-out) your context, idea, source code, etc. and in return you give them $220 and they give you back "out of tokens"
> My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
It's also a way to improve performance on the things their customers care about. I'm not paying Anthropic more than I do for car insurance every month because I want to pinch ~~pennies~~ tokens, I do it because I can finally offload a ton of tedious work on Opus 4.5 without hand holding it and reviewing every line.
The subscription is already such a great value over paying by the token, they've got plenty of space to find the right balance.
> My implication was the "thinking/reasoning" modes are a way for LLM providers to put their thumb on the scale for how much the service costs.
I've done RL training on small local models, and there's a strong correlation between length of response and accuracy. The more they churn tokens, the better the end result gets.
I actually think that the hyper-scalers would prefer to serve shorter answers. A token generated at 1k ctx length is cheaper to serve than one at 10k context, and way way cheaper than one at 100k context.
> there's a strong correlation between length of response and accuracy
i'd need to see real numbers. I can trigger a thinking model to generate hundreds of tokens and return a 3 word response (however many tokens that is), or switch to a non-thinking model of the same family that just gives the same result. I don't necessarily doubt your experience, i just haven't had that experience tuning SD, for example; which is also xformer based
I'm sure there's some math reason why longer context = more accuracy; but is that intrinsic to transformer-based LLMs? that is, per your thought that the 'scalers want shorter responses, do you think they are expending more effort to get shorter, equivalent accuracy responses; or, are they trying to find some other architecture or whatever to overcome the "limitations" of the current?
I believe Claude Code recently turned on max reasoning for all requests. Previously you’d have to set it manually or use the word “ultrathink”