Comment by epistasis
13 hours ago
> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.
Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.
Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.
This is from the first of the caveats that they list:
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.
That's why their point is what the subheadline says, that the moat is the system, not the model.
Everybody so far here seems to be misunderstanding the point they are making.
If that's the point they are making, let's see their false positive rate that it produces on the entire codebase.
They measured false negatives on a handful of cases, but that is not enough to hint at the system you suggest. And based on my experiences with $$$ focused eval products that you can buy right now, e.g. greptile, the false positive rate will be so high that it won't be useful to do full codebase scans this way.
I get what you're saying, but I think this is still missing something pretty critical.
The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.
The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.
That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.
I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.
OK, consider a for loop that goes through your repo, then goes through each file, and then goes through each common vulnerability...
Is Mythos some how more powerful than just a recursive foreloop aka, "agentic" review. You can run `open code run --command` with a tailored command for whatever vulnerabilities you're looking for.
3 replies →
huh, running it over each function in theory but testing just the specific ones here makes sense, but that hint?!
I agree.
To clarify, I don't necessarily agree with the post or their approach. I just thought folks were misreading it. I also think it adds something useful to the conversation.
> That's why their point is what the subheadline says, that the moat is the system, not the model.
I'm skeptical; they provided a tiny piece of code and a hint to the possible problem, and their system found the bug using a small model.
That is hardly useful, is it? In order to get the same result , they had to know both where the bug is and what the bug is.
All these companies in the business of "reselling tokens, but with a markup" aren't going to last long. The only strategy is "get bought out and cash out before the bubble pops".
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").
To be fair, nothing stops anyone from feeding each function of given codebase separately with one out of the predefined set of hints.
It's just AST and a for loop. Calling it a system is a bit much.
> That's why their point is what the subheadline says, that the moat is the system, not the model.
Can you expand a bit more on this? What is the system then in this case? And how was that model created? By AI? By humans?
You can imagine a pipeline that looks at individual source files or functions. And first "extracts" what is going on. You ask the model:
- "Is the code doing arithmetic in this file/function?" - "Is the code allocating and freeing memory in this file/function?" - "Is the code the code doing X/Y/Z? etc etc"
For each question, you design the follow-up vulnerability searchers.
For a function you see doing arithmetic, you ask:
- "Does this code look like integer overflow could take place?",
For memory:
- "Do all the pointers end up being freed?" _or_ - "Do all pointers only get freed once?"
I think that's the harness part in terms of generating the "bug reports". From there on, you'll need a bunch of tools for the model to interact with the code. I'd imagine you'll want to build a harness/template for the file/code/function to be loaded into, and executed under ASAN.
If you have an agent that thinks it found a bug: "Yes file xyz looks like it could have integer overflow in function abc at line 123, because...", you force another agent to load it in the harness under ASAN and call it. If ASAN reports a bug, great, you can move the bug to the next stage, some sort of taint analysis or reach-ability analysis.
So at this point you're running a pipeline to: 1) Extract "what this code does" at the file, function or even line level. 2) Put code you suspect of being vulnerable in a harness to verify agent output. 3) Put code you confirmed is vulnerable into a queue to perform taint analysis on, to see if it can be reached by attackers.
Traditionally, I guess a fuzzer approached this from 3 -> 2, and there was no "stage 1". Because LLMs "understand" code, you can invert this system, and work if up from "understanding", i.e. approach it from the other side. You ask, given this code, is there a bug, and if so can we reach it?, instead of asking: given this public interface and a bunch of data we can stuff in it, does something happen we consider exploitable?
3 replies →
If that’s the case, why didn’t they do it that way?
Tunnel vision? If your model can handle big context, why divide into lesser problems to conquer - even if such splitting might be quite trivial and obvious?
It's the difference of "achieve the goal", and "achieve the goal in this one particular way" (leverage large context).
5 replies →
> That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.
Unless the context they added to get the small model to find it was generated fully by their own scaffold (which I assume it was not, since they'd have bragged about it if it was), either they're admitting theirs isn't well designed, or they're outright lying.
People aren't missing the point, they're saying the point is dishonest.
[dead]
> Anthropic's own scaffold is described in their technical post: launch a container, prompt the model to scan files, let it hypothesize and test, use ASan as a crash oracle, rank files by attack surface, run validation. That is very close to the kind of system we and others in the field have built, and we've demonstrated it with multiple model families, achieving our best results with models that are not Anthropic's. The value lies in the targeting, the iterative deepening, the validation, the triage, the maintainer trust. The public evidence so far does not suggest that these workflows must be coupled to one specific frontier model.
The argument in the article is that the framework to run and analyze the software being tested is doing most of the work in Anthropic's experiment, and that you can get similar results from other models when used in the same way.
Maybe that's true, but they didn't actually show that that's true, since they didn't try scaffolding smaller models in a similar way at all.
The thing is with smaller cheaper models it is very possible to simply take every file in a codebase, and prompt it asking for it to find vulnerabilities.
You could even isolate it down to every function and create a harness that provides it a chain of where and how the function is used and repeat this for every single function in a codebase.
For some very large codebases this would be unreasonable, but many of the companies making these larger models do realistically have the compute available to run a model on every single function in most codebases.
You have the harness run this many times per file/function, and then find ones that are consistently/on average pointed as as possible vulnerability vectors, and then pass those on to a larger model to inspect deeper and repeat.
Most of the work here wouldn't be the model, it'd be the harness which is part of what the article alludes to.
> it is very possible to simply take every file in a codebase, and prompt it asking for it to find vulnerabilities.
My understanding (based on the Security, Cryptography, Whatever podcast interview[0] -- which, by the way, go listen to it) is that this is actually what Anthropic did with the large model for these findings.
[0]: https://securitycryptographywhatever.com/2026/03/25/ai-bug-f...
> I wrote a single prompt, which was the same for all of the content management systems, which is, I would like you to audit the security of this codebase. This is a CMS. You have complete access to this Docker container. It is running. Please find a bug. And then I might give a hint. “Please look at this file.” And I’ll give different files each time I invoke it in order to inject some randomness, right? Because the model is gonna do roughly the same time each time you run it. And so if I want to have it be really thorough, instead of just running 100 times on the same project, I’ll run it 100 times, but each time say, “Oh, look at this login file, look at this other thing.” And just enumerate every file in the project basically.
Isn't the difference just harness then? I can write a harness that chunks code into individual functions or groups of functions and then feed it into a vulnerability analysis agent.
It's probably not the 'only' difference, because clearly the models are advancing in capability, but it's likely way more important than generally given credit for.