← Back to context Comment by OsrsNeedsf2P 2 days ago So it's trained on the SWE Bench Pro evalset 7 comments OsrsNeedsf2P Reply topsycatt 2 days ago That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4https://microsoft.ai/wp-content/uploads/2026/06/main_2026060... lemonish97 2 days ago What is your evidence for this claim? fooker 2 days ago They say hill climbinghttps://microsoft.ai/news/building-a-hillclimbing-machine-la...Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs. artemisart 2 days ago Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything". 1 reply → jongalloway2 2 days ago [dead] ajyoon 2 days ago [flagged]
topsycatt 2 days ago That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
lemonish97 2 days ago What is your evidence for this claim? fooker 2 days ago They say hill climbinghttps://microsoft.ai/news/building-a-hillclimbing-machine-la...Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs. artemisart 2 days ago Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything". 1 reply → jongalloway2 2 days ago [dead] ajyoon 2 days ago [flagged]
fooker 2 days ago They say hill climbinghttps://microsoft.ai/news/building-a-hillclimbing-machine-la...Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs. artemisart 2 days ago Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything". 1 reply → jongalloway2 2 days ago [dead] ajyoon 2 days ago [flagged]
artemisart 2 days ago Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything". 1 reply →
That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
What is your evidence for this claim?
They say hill climbing
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything".
1 reply →
[dead]
[flagged]