Comment by hhh

4 months ago

The models don’t change.

26 comments

hhh

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

armchairhacker 4 months ago
And there’s an incentive to publish evidence of this to discourage it, do you have any?
- TeMPOraL 4 months ago
  
  Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
  
  3 replies →
- woadwarrior01 4 months ago
  
  There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.
  [1]: https://marginlab.ai/trackers/claude-code/
  
  4 replies →
- coldtea 4 months ago
  
  Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...
seunosewa 4 months ago

Or just change the reasoning levels.

fer 4 months ago

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

stavros 4 months ago
Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".
- arcanemachiner 4 months ago
  
  I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.

esskay 4 months ago

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

yorwba 4 months ago
Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.
https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
- nextaccountic 4 months ago
  
  It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark
  
  1 reply →

scrollop 4 months ago

You sure about that?

https://marginlab.ai/trackers/claude-code/

withinboredom 4 months ago

Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.
And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...
50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

coldtea 4 months ago

Only nominally...

pixel_popping 4 months ago

Oh yes, they do.

girvo 4 months ago

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

coldtea 4 months ago

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.