Comment by slopinthebag

9 hours ago

Idk, it seems reasonable to me

> "Our tests gave models the vulnerable function directly, often with contextual hints. A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do."

Also they included a test with a false positive, the small models got it right and Opus got it wrong. So this paper shows with the right approach and harness these smaller models can produce the same results. Thats awesome!

So, if you're struggling to make these smaller models work it's almost certainly an issue of holding them wrong. They require a different approach/harness since they are less capable of working with a vague prompt and have a smaller context, but incredibly powerful when wielded by someone who knows how to use them. And since they are so fast and cheap, you can use them in ways that are not feasible with the larger, slower, more expensive models. But you have to know how to use them, it requires skill unlike just lazily prompting Claude Code, however the results can be far better. If you aren't integrating them in your workflow you're ngmi imo :) This will be the next big trend, especially as they continue to improve relative to SOTA which is running into compute limitations.

Anthropic gave the model the whole codebase and told it to find a vulnerability on a specific file, iterating across sessions focusing on different files.

What happens then is that, for example, the model looks through that particular file, identifies potential problems, and works upwards through the codebase to check whether those could actually be hit.

“Hum, here we assume that the input has been validated, is there any way that might not be the case?”

This is not unique to Mythos. You can already do this with publicly available models. Mythos does appear to be significantly more capable, so it would get better results.

The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.

  • Mmm, Anthropic had a harness that had Mythos check each file as an entry point. That's not quite "here is a codebase, find vulns". A more sophisticated harness with a fast and cheap model could go function-by-function to do the same thing. Which is what this was validating.

    > The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.

    That process can be made part of a harness, again which is what they were validating.

    I'm not sure why people are so hell-bent on disparaging open source models here. I get that some people cant get results from them, but that's just a skill issue - we should all be ecstatic that we don't need to rely on the unethical AI corps to allow us to do our jobs.