Comment by conception
1 day ago
I will note that 2.5 pro preview… march? Was maybe the best model I’ve used yet. The actual release model was… less. I suspect Google found the preview too expensive and optimized it down but it was interesting to see there was some hidden horsepower there. Google has always been poised to be the AI leader/winner - excited to see if this is fluff or the real deal or another preview that gets nerfed.
Dunno if you're right, but I'd like to point out that I've been reading comments like these about every model since GPT 3. It's just starting to seem more likely to me to be a cognitive bias than not.
I haven’t noticed the effect of things getting worse after a release but definitely 2.5’s abilities got worse. Or perhaps they optimized for something else? But I haven’t noticed the usual “things got worse after release!” Except for when sonnet had a bug for a month and gpt5’s autorouter broke.
Yea I don't know. I didn't mean to sound accusatory. I might very well be wrong.
Sometimes it is just bias but the 2.5 pro had benchmarks showing the degradation (plus they changed the name every time so it was obviously a different ckpt or model).
Why would you assume cognitive bias? Any evidence? These things are indeed very expensive to run, and are often run at a loss. Wouldn't quantization or other tuning be just as reasonable of an answer as cognitive bias? It's not like we are talking about reptilian aliens running the whitehouse.
I'm just pointing out a personal observation. Completely anecdotal. FWIW, I don't strongly believe this. I have at least noticed a selection bias (maybe) in myself too as recently as yesterday after GPT 5.1 was released. I asked codex to do a simple change (less than 50LOC) and it made a unrelated change, an early return statement, breaking a very simple state machine that goes from waiting -> evaluate -> done. However, I have to remind myself how often LLMs make dumb mistakes despite often seeming impressive.
1 reply →
I noticed the degradation when Gemini stopped being a good research tool, and made me want to strangle it on a daily basis.
It's incredibly frustrating to have a model start to hallucinate sources and be incapable of revisiting its behavior.
Couldn't even understand that it was making up non-sensical RFC references.