← Back to context

Comment by hhh

15 hours ago

The models don’t change.

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

  • And there’s an incentive to publish evidence of this to discourage it, do you have any?

    • Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.

      3 replies →

    • Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

  • Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

    https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)

    • It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark

      1 reply →

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

You sure about that?

https://marginlab.ai/trackers/claude-code/

  • Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

    And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...

    50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

  • No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.