Comment by Closi

6 months ago

> Suppose it can't. How will you know?

By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.

Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.

0 comments

Closi

No comments yet

Contribute on Hacker News ↗