Comment by gruez

1 day ago

>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...