Comment by motbus3
12 hours ago
With no details, a bird told me of a project which estimated using several millions of tokens per day to automate a team work which got laid off. The operation is now a mess, there is no one willing to be considered liable and since the cheap model they used is about to be retired the company is going to see a 4x increase in price at least.
I have the feeling that the age of 'i can't be blamed by AI stuff' will be a "this was the computer guy mistake" for a moment.
PS. I've been using Claude opus 4.8 and it is worse than 4.6 and I will say that even sonnet 4.6 is better. PhD. Level of software and engineering I believe! I know many PhD who never coded or worked anyway
Glad I'm not the only one. Almost every factual thing with new opus is wrong (and it now even happens with 4.6?). I asked it about car stuff yesterday and it totally misrepresented how a car axle even looks like fundamentally. Today I talked about my CV and it was just plain wrong. I don't know what happened, it wasn't like this a few weeks back and I'm even considering cancelling claude alltogether. GPT 5.5 for coding is fine and way more stable, but regular work is just broken.
By differences in the release dates between 4.7 and 4.8 it seems it was more likely an attempted bugfix
But 4.8 still underperforms on most tasks. I have things running where 4o-mini does it considerably better repeatably.
They might have tuned it for a particular reason and I would not doubt that the harness has been made worse.
Sometimes it teases me to think it does wrong things on purpose
On the topic of older (Claude) models being better... anyone knows anything close to 3.5 (or 3.6) era Sonnet? It was by far the best LLM I had ever asked my doubts too. It actually explained in a human way, not like some AI I need to re read thrice to understand.
(I've used modern Gemini 3.1 pro & claude too. Modern ChatGPT is just as useless, I've never heard a human speak in points. The human brain never encounters that irl.)
This was obviously a conscious choice from the leadership at he frontier labs, and especially OpenAI, considering how 4o turned out.
I don't think they expected the ELIZA effect [0] to explode as much as it did when they started including feedback directly from users into posttraining the next generation, so to be safe they've likely added several regimens of synthetic data ensuring ChatGPT tries to steer away from ELIZA.
[0]: https://en.wikipedia.org/wiki/ELIZA_effect
It is hard to say because there is "affection" memory that it was better than what we had before so it seems it was better.
In my humble opinion that serves nothing, it improved gradually, not exponentially up to 4.5
4.6 seems to be a minor step and the latest 2 are pure rubbish
To me this is clearly a skill issue. Several millions of tokens per day is peanuts, even if uncached. gpt-5.5 is $5 per million of input tokens.
Anybody doing things seriously understand how to optimize their workflows for smaller models once they start to lock in processes.
The expensive tokens are output, not input. A useful rule of thumb is that a million tokens per day means about ~10 tok/s on a 24/7 basis.
Even then, i highly doubt any sort of automation is producing on the order of several millions of tokens daily. The issue I see with the org in parent comment seems to stem from management and not any sort of token repricing.
1 reply →
You talk without even knowing what the thing is about. It is easy peasy to spend millions of tokens per minute if you have the content for it.
This is not about you chatting with your char gpt window for sure.
I don't doubt that the operation as a whole is a disaster, but they should be able to avoid the price increase by using one of the many other cheap models like DeepSeek V4 Flash right?
Deepseek V4 flash and pro are insanely good. Even it was for the same price