Comment by goyozi

6 hours ago

These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.

I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).

Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.

I think the argument is that trying to suggest that they’re close to N months from SOTA.

Realistically I assume they hope readers don’t notice the fine details.

The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.

  • > Realistically I assume they hope readers don’t notice the fine details.

    The pool of people reading such articles while ignoring such details can't be big.

    • I disagree. Most people skim articles, not read them deeply.

      On Hacker News I wonder if most people even opened the article at all most times.

I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).

if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.

honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.

  • 4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.

    • I feel like I had the best and worst ~month experience on 4.6. Initially when it came out, it seemed to ask good questions and genuinely do well on complex tasks. From about mid-March it was absolutely abysmal, it seemed to assume the stupidest answer/angle for everything and make weird mistakes. 4.7 seems decent so far but usage hurts - at some point my company switched me to standard seat and I used up 80% of my session usage in 1 prompt. I got my premium seat back since but I think pro/standard plan + opus 4.7 is unusable for daily driving.

    • That experience is also likely tied to the claude harness around the model, and not being as tuned right after model release. They iterate on this and different models need different words (unfortunately...).