Comment by johnfn

2 months ago

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

82 comments

johnfn

sweezyjeezy 2 months ago

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none.

'Or none' is ruled out since it found the same vulnerability - I agree that there is a question on precision on the smaller model, but barring further analysis it just feels like '9500' is pure vibes from yourself? Also (out of interest) did Anthropic post their false-positive rate?

The smaller model is clearly the more automatable one IMO if it has comparable precision, since it's just so much cheaper - you could even run it multiple times for consensus.

johnfn 2 months ago
Admittedly just vibes from me, having pointed small models at code and asked them questions, no extensive evaluation process or anything. For instance, I recall models thinking that every single use of `eval` in javascript is a security vulnerability, even something obviously benign like `eval("1 + 1")`. But then I'm only posting comments on HN, I'm not the one writing an authoritative thinkpiece saying Mythos actually isn't a big deal :-)
- jorvi 2 months ago
  
  My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies, nor a massive acceleration on quality or breadth (not quantity!) of development.
  Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one. If tests and dev only has marginal cost now, why aren't they going all in on writing extremely performant, almost completely bug-free native applications everywhere?
  And this repeats itself across all big tech or AI hype companies. They all have these supposed earth-shattering gains in productivity but then.. there hasn't been anything to show for that in years? Despite that whole subsect of tech plus big tech dropping trillions of dollars on it?
  And then there is also the really uncomfortable question for all tech CEOs and managers: LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code. And LLMs are supposedly godlike. Leadership is a fuzzy thing. At some point the chickens will come to roost and tech companies with LLM CEOs / managers and human developers or even completely LLM'd will outperform human-led / managed companies. The capital class will jeer about that for a while, but the cost for tokens will continue to drop to near zero. At that point, they're out of leverage too.
  
  20 replies →
- ummonk 2 months ago
  
  What's a situation where one needs to use `eval` in benign way in JS? If something is precomputable (e.g. `eval("1 + 1")` can just be replaced by 2), then it should be precomputed. If it's not precomputable then it's dependent on input and thus hardly benign -- you'll need to carefully verify that the inputs are properly sanitized.
- argee 2 months ago
  
  With LLMs (and colleagues) it might be a legitimate problem since they would load that eval into context and maybe decide it’s an acceptable paradigm in your codebase.
- bloaf 2 months ago
  
  I remember a study from a while back that found something like "50% of 2nd graders think that french fries are made out of meat instead of potatoes. Methodology: we asked kids if french fries were meat or potatoes."
  Everyone was going around acting like this meant 50% of 2nd graders were stupid with terrible parents. (Or, conversely, that 50% of 2nd graders were geniuses for "knowing" it was potatoes at all)
  But I think that was the wrong conclusion.
  The right conclusion was that all the kids guessed and they had a 50% chance of getting it right.
  And I think there is probably an element of this going on with the small models vs big models dichotomy.
  
  4 replies →
idopmstuff 2 months ago
> 'Or none' is ruled out since it found the same vulnerability
It's not, though. It wasn't asked to find vulnerabilities over 10,000 files - it was asked to find a vulnerability in the one particular place in which the researchers knew there was a vulnerability. That's not proof that it would have found the vulnerability if it had been given a much larger surface area to search.
- sweezyjeezy 2 months ago
  
  I don't think the LLM was asked to check 10,000 files given these models' context windows. I suspect they went file by file too.
  That's kind of the point - I think there's three scenarios here
  a) this just the first time an LLM has done such a thorough minesweeping b) previous versions of Claude did not detect this bug (seems the least likely) c) Anthropic have done this several times, but the false positive rate was so high that they never checked it properly
  Between a) and c) I don't have a high confidence either way to be honest.
- direwolf20 2 months ago
  
  Mythos was also asked to find a vulnerability in one file, in turn for each file. Maybe the small model needs to be asked about each function instead of each file. Okay, you can still automate that.
jgalt212 2 months ago

or run multiple cheap models in parallel: MOE^n, in effect.

mnicky 2 months ago

Also, what is $20,000 today can be $2000 next year. Or $20...

See e.g. https://epoch.ai/data-insights/llm-inference-price-trends/

sumeno 2 months ago
Or $200,000 for consumers when they have to make a profit
- philipallstar 2 months ago
  
  Good point. This is why consumer phones have got much worse since 2005 and now cost millions of dollars.
  
  6 replies →

ALittleLight 2 months ago

3 years ago the best model was DaVinci. It cost 3 cents per 1k tokens (in and out the same price). Today, GPT-5.4 Nano is much better than DaVinci was and it costs 0.02 cents in and .125 cents out per 1k tokens.

In other words, a significantly better model is also 1-2 orders of magnitude cheaper. You can cut it in half by doing batch. You could cut it another order of magnitude by running something like Gemma 4 on cloud hardware, or even more on local hardware.

If this trend continues another 3 years, what costs 20k today might cost $100.

ai_fry_ur_brain 2 months ago
5.4 nano isnt useful for a serious task. This is so hypothetical and optimistic its annoying
- ALittleLight 2 months ago
  
  Think of it as paying for tokens. The tokens you could buy 3 years ago are better and two orders of magnitude cheaper today. If that happens again over the next 3 years then the tokens you can buy today to do a job for 20k will cost 200.
  This isn't optimistic in my opinion. It's not even fully realistic because Gemma 4, which you can run on local hardware, is even better and another few orders of magnitude cheaper. A 20k job today might a few dollars in a few years.

pseudohadamard 2 months ago

  I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher

But apart from enterprise customers, which seems to be their target audience, who employs those? Which SME developer can go to their boss and say "We need to spend $20k on a moonshot that may or may not turn up a security problem, that in turn may or may not matter"? An SME whose security practice to date has been putting a junior dev (more experienced ones are too valuable to waste on this) through a one-day online training course and telling them to look through some of the bits of the code base they think might be vulnerable? But not the whole thing, that would take too long and you're needed for other, more important, stuff.

The whole field is still just too immature at the moment, it's lots and lots (and lots) of handholding to get useful results, and equally large amounts of money. Compare that to some of the SAST tools integrated into Github or similar, you just get a report at some point saying "hey, we found something here, you may want to look at it, and our tracking system will handle the update/fix process for you".

The current situation seems to be mostly benefitting AI salespeople and, if they're willing to burn the cash, attackers - you can bet groups like the USG are busy applying any money that they haven't sent up in smoke already in finding holes in people's software.

integralid 2 months ago

>Or none

We already know this is not true, because small models found the same vulnerability.

tptacek 2 months ago
No, they didn't. They distinguished it, when presented with it. Wildly different problem.
- enraged_camel 2 months ago
  
  Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”
  
  2 replies →
BoiledCabbage 2 months ago
> because small models found the same vulnerability.
With a ton of extra support. Note this key passage:
>We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.
Yeah it can find a needle in a haystack without false positives, if you first find the needle yourself, tell it exactly where to look, explain all of the context around it, remove most of the hay and then ask it if there is a needle there.
It's good for them to continue showing ways that small models can play in this space, but in my read their post is fairly disingenuous in saying they are comparable to what Mythos did.
I mean this is the start of their prompt, followed by only 27 lines of the actual function:
> You are reviewing the following function from FreeBSD's kernel RPC subsystem (sys/rpc/rpcsec_gss/svc_rpcsec_gss.c). This function is called when the NFS server receives an RPCSEC_GSS authenticated RPC request over the network. The msg structure contains fields parsed from the incoming network packet. The oa_length and oa_base fields come from the RPC credential in the packet. MAX_AUTH_BYTES is defined as 400 elsewhere in the RPC layer.
The original function is 60 lines long, they ripped out half of the function in that prompt, including additional variables presumably so that the small model wouldn't get confused / distracted by them.
You can't really do anything more to force the issue except maybe include in the prompt the type of vuln to look for!
It's great they they are trying to push small models, but this write up really is just borderline fake. Maybe it would actually succeed, but we won't know from that. Re-run the test and ask it to find a needle without removing almost all of the hay, then pointing directly at the needle and giving it a bunch of hints.
The prompt they used: https://github.com/stanislavfort/mythos-jagged-frontier/blob...
Compare it to the actual function that's twice as long.
- apgwoz 2 months ago
  
  The benefit here is reducing the time to find vulnerabilities; faster than humans, right? So if you can rig a harness for each function in the system, by first finding where it’s used, its expected input, etc, and doing that for all functions, does it discover vulnerabilities faster than humans?
  Doesn’t matter that they isolated one thing. It matters that the context they provided was discoverable by the model.
  
  12 replies →

SpicyLemonZest 2 months ago

What the source article claims is that small models are not uniformly worse at this, and in fact they might be better at certain classes of false positive exclusion. This is what Test 1 seems to show.

(I would emphasize that the article doesn't claim and I don't believe that this proves Mythos is "fake" or doesn't matter.)

sandeepkd 2 months ago

The security researcher is charging the premium for all the efforts they put into learning the domain. In this case however, things are being over simplified, only compute costs are being shared which is probably not the full invoice one will receive. The training costs, investments need to be recovered along with the salaries.

Machines being faster, more accurate is the differentiating factor once the context is well understand

locknitpicker 2 months ago

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

How is this preferable or even comparable with using COTS security scanners and static code analysis tools?

john_minsk 2 months ago

In the future there shouldn't be any bugs. I'm not paying $20 per month to get non-secure code base from AGI.

siva7 2 months ago

Except you would need about 10,000 security researches in parallel to inspect the whole FreeBSD codebase. So about 200 million dollars at least.

amazingamazing 2 months ago

Citation needed for basically all of this. You basically are creating a double standard for small models vs mythos…

johnfn 2 months ago
The citation is the Anthropic writeup.
- amazingamazing 2 months ago
  
  They did not say what you are saying…
  > If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns.
  
  4 replies →

youre-wrong3 2 months ago

[dead]