Comment by Closi
20 hours ago
It's not particularly interesting if Deep Mind comes to the same (correct) conclusion on a single problem as o3 but costs more. You could ask gpt 2.5 and gpt4 what 1+1= and would get the same response with gpt 4 costing more, but this doesn't tell us much about model capability or value.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
> It would be more interesting to know if it can handle problems that o3 can't do
Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".
> Suppose it can't. How will you know?
By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.
Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.