← Back to context

Comment by OsrsNeedsf2P

2 days ago

So it's trained on the SWE Bench Pro evalset

7 comments

OsrsNeedsf2P

Reply

topsycatt 2 days ago

That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4

https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...

lemonish97 2 days ago

What is your evidence for this claim?

fooker 2 days ago
They say hill climbing
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
- artemisart 2 days ago
  
  Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything".
  
  1 reply →
- jongalloway2 2 days ago
  
  [dead]
- ajyoon 2 days ago
  
  [flagged]