Comment by XCSme
1 month ago
I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com
I also noticed the same thing: some models reason correctly but draw the wrong conclusions.
And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.
No comments yet
Contribute on Hacker News ↗