Comment by consumer451

6 hours ago

Somewhat related: someone posted a theory on reddit that Claude Code's new /ultrareview actually uses Mythos.

Does that seem plausible to anyone else? It runs on their cloud. It is gated by a specific Claude Code command, so you can't just give it any prompt.

7 comments

consumer451

Reply

tekacs 6 hours ago

Something in favor of this is the fact that it runs in their cloud and literally tells you that it costs I think $10 to $25 per run

1ucky 6 hours ago

Why would they use their most expensive model when sonnet or opus can do the job as well?

K0balt 2 hours ago
In my experience sonnet<opus by a long shot for code review. Sonnet often flags things as errors that are not, because it fails to grasp the big picture… and also fails to grasp structural issues that are perfectly coded and only show up as problems at the meta scale.
I have no reason to believe that the next generation won’t offer similar gains in verification, and there is some evidence to support that the cybersecurity implications are the result of exactly this expansion of ability.
- thepasch 2 hours ago
  
  It depends on how you review. In an orchestrated per-task review workflow with clearly defined acceptance criteria and implementation requirements, using anything other than Sonnet (handed those criteria and requirements) hasn’t really led to much improvement, but it drives up usage and takes longer. I even tried Haiku, but, yeah, Haiku is just not viable for review, even tightly scoped, lol.
  Siccing Sonnet on a codebase or PR without guidance does indeed lead to worse results than using Opus, though.

0x696C6961 6 hours ago

It would be pretty simple to see what API they're calling.

consumer451 6 hours ago
That's what I meant to get at by "it runs on their cloud."
They can name that user-facing ultrareview API endpoint whatever they want, and we have no way to see what model endpoint it calls internally once running on their cloud, right?
- zarzavat 5 hours ago
  
  Introduce intentional and increasingly subtle vulns and test against Sonnet, Opus, etc? Should give statistical evidence of its power.