← Back to context

Comment by ComplexSystems

8 days ago

o3 is the best OpenAI model but it still makes tons of mistakes. It's got a very strong background in most of undergrad level math, and a decent amount of grad level machine learning stuff, but its tendency to hallucinate means it will greedily fixate on some initial conjecture early on, not realize it's a conjecture, and continue to assert that it's true for the rest of the conversation. Similarly, if it thinks something is impossible, it will just assert that and continue to assert again and again that it's impossible, even if it's actually true. It's like the mathematical version of a hallucination. There is no real reason it should do this for grad level topics - they just haven't trained it enough. It has a survey level knowledge of a TON of ideas, which can be great if you are looking for topics related to something, but as far as the details of exactly how things are related, what subtleties and caveats there are and so on, it will just hallucinate its first guess and get stuck there for the rest of the conversation.

o3-pro is maybe marginally better, but it takes a very long time to respond and so I rarely use it.

4o is much worse and so I usually use o3.

Gemini 2.5 Pro is much better - and free. Grok 4 is also probably up there with Gemini 2.5. They just have less tendency to hallucinate in this way in general: they will spend more time reasoning, checking claims, searching for prior literature, etc. They still mess up, but not quite as much as o3. I don't use Sonnet or Opus for math all that much - my impression was that o3 was better than Sonnet 3.7 but not sure about 4.