Comment by epistasis
18 hours ago
I was a bit confused by your definitions, but here's how Mozilla broke out [1] the 271, um, things:
> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.
> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].
Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.
[1] https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
[2] https://wiki.mozilla.org/Security_Severity_Ratings/Client
I work at Mozilla; I fixed a bunch of these bugs.
In general, I would say that our use of "vulnerability" lines up with what jerrythegerbil calls "potential vulnerability". (In cases with a POC, we would likely use the word "exploit".) Our goal is to keep Firefox secure. Once it's clear that a particular bug might be exploitable, it's usually not worth a lot of engineering effort to investigate further; we just fix it. We spend a little while eyeballing things for the purpose of sorting into sec-high, sec-moderate, etc, and to help triage incoming bugs, but if there's any real question, we assume the worst and move on.
So were all 271 bugs exploitable? Absolutely not. But they were all security bugs according to the normal standards that we've been applying for years.
(Partial exception: there were some bugs that might normally have been opened up, but were kept hidden because Mythos wasn't public information yet. But those bugs would have been marked sec-other, and not included in the count.)
So if you think we're guilty of inflating the number of "real" vulnerabilities found by Mythos, bear in mind that we've also been consistently inflating the baseline. The spike in the Firefox Security Fixes by Month graph is very, very real: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
You may not be able to comment, but do you feel like Mythos is accomplishing anything that couldn't have already been done with Opus and the right prompting?
I've assumed I could send an agent using a publicly available model bug hunting in a codebase like this and get tons of results, assuming I wanted to burn the tokens, so it's really unclear to me whether the Mythos hype is justified or if it's just an easy button (and subsidized tokens?) to do what is already possible.
I never got direct access to Mythos, so all I know is what I've seen from the quality of the bugs being produced. I also haven't been involved at the prompting end.
So the best answer I can give is: I dunno, maybe it's possible to find bugs like this using Opus, but if so, where are they? Did nobody think to try "please find the bug in this code" pre-Mythos? I've done enough auditing with Opus to be convinced that it can be a good assistant to somebody who already knows what they're doing, but in practice the big wave of AI-discovered bugs started with Mythos.
I'm sure lots of people have assumed they could send a publicly available model bug hunting and find things. I have not noticed a huge amount of success. We've had some very nice correctness bugs reported, but skimming through the list of security bugs I've fixed recently, the AI-related ones all seem to be Mythos.
My best guess is that Mythos is just enough better along just enough axes that its hit rate on finding potential bugs and filtering out the real ones from the hallucinations is good enough to matter. Like, there's no obvious qualitative difference between 3.6kg of uranium-232 and 3.8 kg of uranium-232, just a small quantitative increase. But if you form both of them into spheres, only one of them has reached critical mass. Can you do something clever to reach critical mass with 3.6kg of uranium? Maybe! But needing to do something clever is a non-trivial barrier in itself.
What types of vulnerabilities was it finding? Cross site scripting, privilege escalation, etc? Mostly memory corruption or any Javascript logic bugs?
I work on SpiderMonkey, so I mostly looked at the JS bugs. It was a smorgasbord of various things. Broadly speaking I'd say the most impressive bugs were TOCTOU issues, where we checked something and later acted on it, and the testcase found a clever way to invalidate the result of the check in between.
If you look closely at, say, this patch, you might get a sense of what I mean (although the real cleverness is in the testcase, which we have not made public): https://hg-edge.mozilla.org/integration/autoland/rev/c29515d...
6 replies →
I'd say it leans towards memory corruption kinds of issues, as those are easiest to pass the validator, thanks to AddressSanitizer. I think there's a lot of potential for making the validator more sophisticated. Like maybe you add a JS function that will only crash when run in the parent process and have a validator that checks for that specific crash, as a way for the LLM to "prove" that it managed to run arbitrary JS in the parent. Would that turn up subtler issues? Maybe.
I'm not a security dev or researcher or anything, but as an outsider my understanding matches how Mozilla uses the terms. Though words used by specialists and the general public can offer differ...
How about this: a "vulnerability" is a "vulnerability", but after it was identified and verified to cause problem, that's when it should be called a "bug", because it could make the software do unwanted things.
At Mozilla, everything is called a bug. It's what other systems call an "issue". So it's too late for your terminology at Mozilla. (Example: I have a bug to improve the HTML output of my static analysis tool. There is nothing incorrect or flawed about the current output.)
At Mozilla, but not everywhere: exploits are a subset of vulnerabilities are a subset of bugs.
1 reply →
When I worked at Mozilla, _everything_ was called a bug, whether it was a software issue, a problem in the office or some paperwork missing.
Much as GitHub calls everything an "issue" and GitLab a "work item".
Can you elaborate why those bugs weren't found by e.g. fuzzing in the past?
I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.
Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?
If I had to guess, I'd say that AI is better at finding TOCTOU bugs than fuzzing because it starts by looking at the code and trying to find problems with it, which naturally leads it to experiment with questions like "is there any way to make this assumption false?", whereas fuzzing is more brute force. Fuzzing can explore way more possible states, but AI is better at picking good ones.
In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".
Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.
Fuzzing isn't good at things like dealing with code behind a CRC check, whereas the audit based approach using an LLMs can see the sketchy code, then calculate the CRC itself to come up with a test case. I think you end up having to write custom fuzzing harnesses to get at the vulnerable parts of the code. (This is an example from a talk by somebody at Anthropic.)
That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.
Presumably there are (implicit?) "sec-none" things, like [a] from the recently released 150.0.2 [b] which makes absolutely zero mention about "Security Impact" or "Severity" in the bug report, unlike [c], which is listed in the Mozilla weblog post [2].
Security things are mentioned in the Release Notes [b] pointing to a completely different document [d].
Perhaps sometimes a bug is 'just' a bug, and not a vulnerability.
[a] https://bugzilla.mozilla.org/show_bug.cgi?id=2034980 ; "Can't highlight image scans in Firefox 150+"
[b] https://www.firefox.com/en-CA/firefox/150.0.2/releasenotes/
[c] https://bugzilla.mozilla.org/show_bug.cgi?id=2024918
[d] https://www.mozilla.org/en-US/security/advisories/mfsa2026-4...
> Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit.
That’s not evident in what you pastedat all.
What you pasted says
> sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior […] We make no technical difference between these […] sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
From this one infers that the "180 were sec-high" bugs found are actually exploitsble but known to have been found in the wild, and are NOT mere annoying bugs.
The difference between 180 and 270 does nothing to deflate the signicance, or lack there of, of the implication re: Mythos.
Yes, it is not in what I pasted, as I said, "even though they say right below". If you don't believe me then click on either of the links.