Comment by jansan 2 hours ago So you believe one marketing department more than the other? 2 comments jansan Reply NitpickLawyer 2 hours ago The brits have a step-based benchmark that they use for this - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...They seem pretty close, in both average and "best run" scores. And, in a highly verifiable domain, "best run" or pass@n is what you're looking for. aesthesia 1 hour ago Worth looking at the followup post that evaluates the current version of Mythos, which solves one of the main tasks that GPT-5.5-Cyber does not. https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber...
NitpickLawyer 2 hours ago The brits have a step-based benchmark that they use for this - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...They seem pretty close, in both average and "best run" scores. And, in a highly verifiable domain, "best run" or pass@n is what you're looking for. aesthesia 1 hour ago Worth looking at the followup post that evaluates the current version of Mythos, which solves one of the main tasks that GPT-5.5-Cyber does not. https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber...
aesthesia 1 hour ago Worth looking at the followup post that evaluates the current version of Mythos, which solves one of the main tasks that GPT-5.5-Cyber does not. https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber...
The brits have a step-based benchmark that they use for this - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...
They seem pretty close, in both average and "best run" scores. And, in a highly verifiable domain, "best run" or pass@n is what you're looking for.
Worth looking at the followup post that evaluates the current version of Mythos, which solves one of the main tasks that GPT-5.5-Cyber does not. https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber...