Comment by babelfish

1 day ago

Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —

160 comments

babelfish

sourcecodeplz 1 day ago

Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).

ru552 1 day ago
There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.
- varispeed 1 day ago
  
  Sounds like a good opportunity to pause spending on nerfed 4.6 and wait for the new model to be released and then max out over 2 weeks before it gets nerfed again.
  
  8 replies →
- enraged_camel 1 day ago
  
  That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.
  
  1 reply →
- swalsh 1 day ago
  
  My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.
  
  15 replies →
lumost 1 day ago

Is this even real? coming off the heals of GLM5.1's announcement this feels almost like a llama 4 launch to hedge off competition.
m3kw9 7 hours ago
not much of a jump 94.5% / 91.3%
- kkoncevicius 4 hours ago
  
  We can look at the same numbers in different way:
  Error with 91.3% = 8.7% Error with 94.5% = 5.5% Error reduction = 8.7% - 5.5% = 3.2%
  So the improvement is 3.2% / 8.7% = 36.8%
- enraged_camel 7 hours ago
  
  Actually, going from 91.3% to 94.5% is a significant jump, because it means the model has gotten a lot better at solving the hardest problems thrown at it. This has downstream effects as well: it means that during long implementation tasks, instead of getting stuck at the most challenging parts and stopping (or going in loops!), it can now get past them to finish the implementation.
Jcampuzano2 1 day ago
A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.
I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.
They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.
- cedws 1 day ago
  
  More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.
  This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.
  
  36 replies →
- ben_w 14 hours ago
  
  > I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.
  I read it like I always read the GPT-2 announcement no matter what others say: It's *not* being called "too dangerous to ever release", but rather "we need to be mindful, knowing perfectly well that other AI companies can replicate this imminently".
  The important corps (so presumably including the Linux Foundation, bigger banks and power stations, and quite possibly excluding x.com) will get access now, and some other LLM which is just as capable will give it to everyone in 3 months time at which point there's no benefit to Anthropic keeping it off-limits.
- marcus_holmes 21 hours ago
  
  This is my nightmare about AI; not that the machines will kill all the humans, but that access is preferentially granted to the powerful and it's used to maintain the current power structure in blatant disregard of our democratic and meritocratic ideals, probably using "security" as the justification (as usual).
- alwillis 14 hours ago
  
  > They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.
  That’s not going to happen. If you recall, OpenAI didn’t release a model a few years ago because they felt it was too dangerous.
  Anthropic is giving the industry a heads up and time to patch their software.
  They said there are exploitable vulnerabilities in every major operating system.
  But in 6 months every frontier model will be able to do the same things. So Anthropic doesn’t have the luxury of not shipping their best models. But they also have to be responsible as well.
- quotemstr 1 day ago
  
  This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.
  
  5 replies →
- mike_hearn 8 hours ago
  
  I think they already said somewhere that they can't release Mythos because it requires absurdly large amounts of compute. The economics of releasing it just don't work.
  
  1 reply →
- guzfip 1 day ago
  
  > A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.
  > They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped
  Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.

WarmWash 1 day ago

Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.

mulmboy 1 day ago
There are a few hints in the doc around this
> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)
^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.
> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)
> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)
- alyxya 1 day ago
  
  The first point is along the lines of what I'd expect given that claude code is generally reliable at this point. A model's raw intelligence doesn't seem as important right now compared to being able to support arbitrary length context.
- derangedHorse 17 hours ago
  
  The quote comparing them here was for BrowseComp which "tests an agent's ability to find hard-to-locate information on the open web." (for those wondering). The new model seems significantly better than Opus4.6 judging by the 'Overall results summary'
- zozbot234 1 day ago
  
  Good catch. If it's "too slow" even when ran in a state-of-the-art datacenter environment, this "Mythos" model is most closely comparable to the "Deep Research" modes for GPT and Gemini, which Claude formerly lacked any direct equivalent for.
  
  2 replies →
- naasking 6 hours ago
  
  I'm curious if frontier labs use any forms of compression on their models to improve performance. The small % drop of Q8 or FP8 would still put it ahead of Opus, but should double token throughput. Maybe then interactive use would feel like an improvement.

WinstonSmith84 1 day ago

Not discussing Mythos here, but Opus. Opus to me has been significantly better at SWE than GPT or Gemini - that gets me confused why Opus is ranking clearly lower than GPT, and even lower than Gemini.

muyuu 21 hours ago
When did you last compare them? Codex right now is considerably better in my experience. Can't speak for Gemini.
- gck1 20 hours ago
  
  Tried Gemini 2 weeks ago to see where it's at, with gemini-cli.
  Failed to use tools, failed to follow instructions, and then went into deranged loop mode.
  Essentially, it's where it was 1.5 years ago when I tried it the last time.
  It's honestly unbelievable how Google managed to fail so miserably at this.
  
  3 replies →
- sandos 14 hours ago
  
  Agree, I never actually had great success with Opus. I think its the failures that are annoying, its probably better than codex when its "good", but it fails in annoying ways that I think codex very seldom does.
- StingyJelly 11 hours ago
  
  I wouldn't call codex considerably better. It may depend on specific codebase and your expectations, but codex produces more "abstraction for the sake of abstraction" even on simple tasks, while opus in my experience usually chooses right level of abstraction for given task.
otabdeveloper4 11 hours ago

A secret art known to the cognoscenti as "benchmark gaming".

pants2 1 day ago

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

Leynos 1 day ago

Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/
GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/
randomtoast 1 day ago
Humanity's Last Exam (HLE) is already insanely difficult. It introduces 2,500 questions spanning mathematics, humanities, natural sciences, ancient languages, ...
Here is an example question: https://i.redd.it/5jl000p9csee1.jpeg
No human could even score 5% on HLE.
- saberience 12 hours ago
  
  I've never understood the point of things like HLE, it doesn't really prove or show anything since 99.99% of humans can't do a single question on this exam.
  That is, it's easy to make benchmarks which humans are bad at, humans are really bad at many things.
  Divide 123094382345234523452345111 by 0.1234243131324, guess what, humans would find that hard, computers easy. But it doesn't mean much.
  Humanity's last exam (HLE) couldn't be completed by most of humanity, the vast majority, so it doesn't really capture anything about humanity or mean much if a computer can do it.
  
  2 replies →

AlexC04 1 day ago

but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!

(edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)

bertil 1 day ago

We are all fans for Simon’s work, and his test is, strangely enough, quite good.

ninjagoo 1 day ago

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?

TacticalCoder 1 day ago
> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen
We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.
This sounds like a much better model than Opus 4.6.
- ninjagoo 1 day ago
  
  > We're not reading the same numbers I think.
  We must not be.
  That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.
  Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.
  
  8 replies →
enraged_camel 1 day ago

Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.

matheusmoreira 1 day ago

Wow. Mythos must be insanely good considering how good a model Opus already is. I hope it's usable on a humble subscription...

crimsoneer 14 hours ago
You get a single call a month. Use it wisely.
- FridgeSeal 10 hours ago
  
  What is the meaning of life, the universe, and everything?
  > Thought for 7.5 million years
  
  1 reply →

cesarvarela 17 hours ago

I thought they were bluffing when they talked about the scaling laws, but looking at the benchmark scores, they were not.

I wonder if misalignment correlates with higher scores.

whalesalad 1 day ago

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.

babelfish 1 day ago
Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)
- girvo 1 day ago
  
  It really isn’t. I wish it was, because work complains about overuse of Opus.
  
  4 replies →
rafaelmn 1 day ago
GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.
- sho_hn 1 day ago
  
  Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.
  Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.
  It's annoying, too, because I don't much like OpenAI as a company.
  (Background: 25 years of C++ etc.)
  
  1 reply →
- Jcampuzano2 1 day ago
  
  Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.
  But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.
- camdenreslink 1 day ago
  
  ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).
- leobuskin 1 day ago
  
  And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus
  
  7 replies →
- zarzavat 1 day ago
  
  Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.
  
  6 replies →
- whalesalad 1 day ago
  
  This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.
  
  1 reply →

johnnichev 1 day ago

damn... ok that's impressive.

simianwords 1 day ago

The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.

ollin 1 day ago
My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.
OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:
https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions
> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix
> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time
> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.
- simianwords 1 hour ago
  
  > My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.
  Anthropic accounts for this
  >To detect memorization, we use a Claude-based auditor that compares each model-generated patch against the gold patch and assigns a [0, 1] memorization probability. The auditor weighs concrete signals—verbatim code reproduction when alternative approaches exist, distinctive comment text matching ground truth, and more—and is instructed to discount overlap that any competent solver would produce given the problem constraints.
- simianwords 1 day ago
  
  I stand corrected.

maplethorpe 13 hours ago

Funny, I made my own model at home and got even higher scores than these. I'm a bit concerned about releasing it, though, so I'm just going to keep it local for now.