Comment by ninjagoo

1 day ago

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?

> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.

  • > We're not reading the same numbers I think.

    We must not be.

    That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

    Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.

    • > barely competitive

      It's higher than all other models except vs Gemini 3.1 Pro on MMMLU

      MMMLU is generally thought to be maxed out - as it it might not be possible to score higher than those scores.

      > Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%[1]

      Other models get close on GPQA Diamond, but it wouldn't be surprising to anyone if the max possible on that was around the 95% the top models are scoring.

      [1] https://en.wikipedia.org/wiki/MMLU

    • You are reading the percentages wrong.

      Because 100% is maximum, you should be looking at error rates instead. GPT has 25% on Terminal Bench and the new model has 18%, almost 1.4x reduction.

    • barely competitive ? Mythos column is the first column.

      You are the only person with this take on hackernews, everyone else "this is a massive a jump". Fwiwi, the data you list shows the biggest jump I remember for mythos

      5 replies →

Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.