Comment by tombert
12 days ago
I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me.
It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier.
My favorite early chatgpt math problem was "prove there exists infinitely many even primes" . Easy! Take a finite set of even primes, multiply them and add one to get a number with a new even prime factor.
Of course, it's gotten a bit better than this.
IIRC, that is actually the standard proof that there are infinitely many primes[1] or maybe this variation on it[2].
[1]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Euclid's_pr...
[2]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Proof_using...
Yes this is the standard proof of infinitely many primes but note that my prompt asked for infinitely many even primes. The point is that GPT would take the correct proof and insert "even" at sensible places to get something that looks like a proof but is totally wrong.
Of course it's much better now, but with more pressure to prove something hard the models still just insert nonsense steps.
LLMs have improved so much the original ChatGPT isn't relevant.
I remember that being true of early ChatGPT, but it's certainly not true anymore; GPT 4o and 5 have tagged along with me through all of MathAcademy MFII, MFIII, and MFML (this is roughly undergrad Calc 2 and then like half a stat class and 2/3rds of a linear algebra class) and I can't remember it getting anything wrong.
Presumably this is all a consequence of better tool call training and better math tool calls behind the scenes, but: they're really good at math stuff now, including checking my proofs (of course, the proof stuff I've had to do is extremely boring and nothing resembling actual science; I'm just saying, they don't make 7th-grader mistakes anymore.)
It's definitely gotten considerably better, though I still have issues with it generating proofs, at least with TLAPS.
I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.
I agree that this is largely an early ChatGPT problem though, I just thought it was interesting in that they were "plausible" mistakes. I could totally see twelve-year-old tombert making these exact mistakes, so I thought it was interesting that a robot is making the same mistakes an amateur human makes.
I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.
Maybe, but they swear they didn't use external tools on the IMO problem set.
I assumed it was just writing SymPy or something.