Comment by lebovic

3 days ago

I think the third chart is the most notable; Mythos is the first model which saturated that eval from the UK AISI [1].

Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up.