Comment by ComplexSystems
9 days ago
It makes too many mistakes and is just way too sloppy with math. It shouldn't be this hard to do pair-theorem-proving with it. It cannot tell the difference between a conjecture that sounds kind of vaguely plausible and something that is actually true, and literally the entire point of math is to successfully differentiate between those two situations. It needs to be able to carefully keep track of which claims it's making are currently proven, either in the current conversation or in the literature, vs which are just conjectural and just sound nice. This doesn't seem inherently harder than any other task you folks have all solved, so I would just hire a bunch of math grad students and just go train this thing. It would be much better.
I think that's less a math thing and more a rigorous treatment of anything. I find most llms are subject to this type of error where as your conversation context gets longer it becomes dramatically stupider. Heck, try playing chess with it. Once it comes to the midgame it's forgetting which moves it just made, like literally the message previous, hallucinating the context up to that point -- even when providing it the position.
Curious to know how the different models compare for you for doing math. Heard o4-mini is really good at math but haven’t tried o3-pro much.
o3 is the best OpenAI model but it still makes tons of mistakes. It's got a very strong background in most of undergrad level math, and a decent amount of grad level machine learning stuff, but its tendency to hallucinate means it will greedily fixate on some initial conjecture early on, not realize it's a conjecture, and continue to assert that it's true for the rest of the conversation. Similarly, if it thinks something is impossible, it will just assert that and continue to assert again and again that it's impossible, even if it's actually true. It's like the mathematical version of a hallucination. There is no real reason it should do this for grad level topics - they just haven't trained it enough. It has a survey level knowledge of a TON of ideas, which can be great if you are looking for topics related to something, but as far as the details of exactly how things are related, what subtleties and caveats there are and so on, it will just hallucinate its first guess and get stuck there for the rest of the conversation.
o3-pro is maybe marginally better, but it takes a very long time to respond and so I rarely use it.
4o is much worse and so I usually use o3.
Gemini 2.5 Pro is much better - and free. Grok 4 is also probably up there with Gemini 2.5. They just have less tendency to hallucinate in this way in general: they will spend more time reasoning, checking claims, searching for prior literature, etc. They still mess up, but not quite as much as o3. I don't use Sonnet or Opus for math all that much - my impression was that o3 was better than Sonnet 3.7 but not sure about 4.
I asked 04-mini how many prime numbers are divisible by 35 with a remainder of 6. It confidently stated that there are 'none'. It hadn't even tried hard enough to get to 41.
Yes, I've experienced this especially with spreadsheets. I work in marketing, and I've attempted to use ChatGPT to analyze and summarize large spreadsheets. Sadly, i've learned it can't be trusted to do that.