Comment by lucianbr

21 hours ago

> It would be more interesting to know if it can handle problems that o3 can't do

Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".

> Suppose it can't. How will you know?

By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.

Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.