An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.
Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.
If you use the API, you pay for a specific model, yes, but even then there are "workarounds" for them, such as someone else pointed out by reducing the amount of time they let it "think".
If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".
Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load
No, basically, the requests are processed in batches, together, and the order they're listed in matters for the results, as the grid (tiles) that the GPU is ultimately processing, are different depending on what order they entered at.
So if you want batching + determinism, you need the same batch with the same order which obviously don't work when there are N+1 clients instead of just one.
An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.
This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.
Per Anthropic’s RCA linked in Ops post for September 2025 issues:
“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”
So according to Anthropic they are not tweaking quality setting due to demand.
15 replies →
Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.
Sure, I'll take a cup of coffee while I wait (:
1 reply →
If you aren't defrauding your customers you will be left behind in 2026
1 reply →
They don't advertise a certain quality. You take what they have or leave it.
> I think delivering lower quality than what was advertised and benchmarked is borderline fraud
welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.
If there's no way to check, then how can you claim it's fraud? :)
There is no level of quality advertised, as far as I can see.
2 replies →
I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.
It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.
Or just reducing the reasoning tokens.
They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.
If you use the API, you pay for a specific model, yes, but even then there are "workarounds" for them, such as someone else pointed out by reducing the amount of time they let it "think".
If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".
1 reply →
Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
1 reply →
I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load
from what I understand this can come from the batching of requests.
So, a known bug?
No, basically, the requests are processed in batches, together, and the order they're listed in matters for the results, as the grid (tiles) that the GPU is ultimately processing, are different depending on what order they entered at.
So if you want batching + determinism, you need the same batch with the same order which obviously don't work when there are N+1 clients instead of just one.
5 replies →