Comment by lebovic
3 days ago
I think the third chart is the most notable; Mythos is the first model which saturated that eval from the UK AISI [1].
Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up.
No comments yet
Contribute on Hacker News ↗