← Back to context

Comment by elicash

11 hours ago

This is from the first of the caveats that they list:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.

That's why their point is what the subheadline says, that the moat is the system, not the model.

Everybody so far here seems to be misunderstanding the point they are making.

If that's the point they are making, let's see their false positive rate that it produces on the entire codebase.

They measured false negatives on a handful of cases, but that is not enough to hint at the system you suggest. And based on my experiences with $$$ focused eval products that you can buy right now, e.g. greptile, the false positive rate will be so high that it won't be useful to do full codebase scans this way.

I get what you're saying, but I think this is still missing something pretty critical.

The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.

The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.

That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.

I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.

  • OK, consider a for loop that goes through your repo, then goes through each file, and then goes through each common vulnerability...

    Is Mythos some how more powerful than just a recursive foreloop aka, "agentic" review. You can run `open code run --command` with a tailored command for whatever vulnerabilities you're looking for.

    • newer models have larger context windows, and more stable reasoning across larger context windows.

      If you point your model directly at the thing you want it to assess, and it doesn't have to gather any additional context you're not really testing those things at all.

      Say you point kimi and opus at some code and give them an agentic looping harness with code review tools. They're going to start digging into the code gathering context by mapping out references and following leads.

      If the bug is really shallow, the model is going to get everything it needs to find it right away, neither of them will have any advantage.

      If the bug is deeper, requires a lot more code context, Opus is going to be able to hold onto a lot more information, and it's going to be a lot better at reasoning across all that information. That's a test that would actually compare the models directly.

      Mythos is just a bigger model with a larger context window and, presumably, better prioritization and stronger attention mechanisms.

      2 replies →

huh, running it over each function in theory but testing just the specific ones here makes sense, but that hint?!

  • I agree.

    To clarify, I don't necessarily agree with the post or their approach. I just thought folks were misreading it. I also think it adds something useful to the conversation.

> That's why their point is what the subheadline says, that the moat is the system, not the model.

I'm skeptical; they provided a tiny piece of code and a hint to the possible problem, and their system found the bug using a small model.

That is hardly useful, is it? In order to get the same result , they had to know both where the bug is and what the bug is.

All these companies in the business of "reselling tokens, but with a markup" aren't going to last long. The only strategy is "get bought out and cash out before the bubble pops".

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").

To be fair, nothing stops anyone from feeding each function of given codebase separately with one out of the predefined set of hints.

It's just AST and a for loop. Calling it a system is a bit much.

> That's why their point is what the subheadline says, that the moat is the system, not the model.

Can you expand a bit more on this? What is the system then in this case? And how was that model created? By AI? By humans?

  • You can imagine a pipeline that looks at individual source files or functions. And first "extracts" what is going on. You ask the model:

    - "Is the code doing arithmetic in this file/function?" - "Is the code allocating and freeing memory in this file/function?" - "Is the code the code doing X/Y/Z? etc etc"

    For each question, you design the follow-up vulnerability searchers.

    For a function you see doing arithmetic, you ask:

    - "Does this code look like integer overflow could take place?",

    For memory:

    - "Do all the pointers end up being freed?" _or_ - "Do all pointers only get freed once?"

    I think that's the harness part in terms of generating the "bug reports". From there on, you'll need a bunch of tools for the model to interact with the code. I'd imagine you'll want to build a harness/template for the file/code/function to be loaded into, and executed under ASAN.

    If you have an agent that thinks it found a bug: "Yes file xyz looks like it could have integer overflow in function abc at line 123, because...", you force another agent to load it in the harness under ASAN and call it. If ASAN reports a bug, great, you can move the bug to the next stage, some sort of taint analysis or reach-ability analysis.

    So at this point you're running a pipeline to: 1) Extract "what this code does" at the file, function or even line level. 2) Put code you suspect of being vulnerable in a harness to verify agent output. 3) Put code you confirmed is vulnerable into a queue to perform taint analysis on, to see if it can be reached by attackers.

    Traditionally, I guess a fuzzer approached this from 3 -> 2, and there was no "stage 1". Because LLMs "understand" code, you can invert this system, and work if up from "understanding", i.e. approach it from the other side. You ask, given this code, is there a bug, and if so can we reach it?, instead of asking: given this public interface and a bunch of data we can stuff in it, does something happen we consider exploitable?

    • That's funny, this is how I've been doing security testing in my code for a while now, minus the 'taint analysis'. Who knew I was ahead of the game. :P

      In all seriousness though, it scares me that a lot of security-focused people seemingly haven't learned how LLMs work best for this stuff already.

      You should always be breaking your code down into testable chunks, with sets of directions about how to chunk them and what to do with those chunks. Anyone just vaguely gesturing at their entire repo going, "find the security vulns" is not a serious dev/tester; we wouldn't accept that approach in manual secure coding processes/ SSDLCs.

      1 reply →

    • I think there is already papers and presentations on integrating these kind of iterative code understanding/verificaiton loops in harnesses. There may be some advantages over fuzzing alone. But I think the cost-benefit analysis is a lot more mixed/complex than anthropic would like people to believe. Sure you need human engineers but it's not like insurmountably hard for a non-expert to figure out

If that’s the case, why didn’t they do it that way?

  • Tunnel vision? If your model can handle big context, why divide into lesser problems to conquer - even if such splitting might be quite trivial and obvious?

    It's the difference of "achieve the goal", and "achieve the goal in this one particular way" (leverage large context).

    • I meant, if the claim here is that small models can accomplish the same things with good scaffolding, why didn’t they demonstrate finding those problem with good scaffolding rather than directly pointing them at the problem?

      4 replies →

> That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.

Unless the context they added to get the small model to find it was generated fully by their own scaffold (which I assume it was not, since they'd have bragged about it if it was), either they're admitting theirs isn't well designed, or they're outright lying.

People aren't missing the point, they're saying the point is dishonest.