Comment by data_maan
17 days ago
Very serious for mathematicians - not for ML researchers.
If the paper would not have had the AI spin, would those 10 questions still have been interesting?
It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.
I’m confused by this comment. I’m pretty sure that someone at all the bigs labs is running these questions through their models and will report back as soon as the results arrive (if not sooner, assuming they can somehow verify the answers).
The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.
Yes, but people at those labs may be running those problems because a Fields Medalist is in the paper, and it got hype.
Not because of the problems, and not because this is new methodology.
And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.
So what do we really learn?
Ah. I think the issue is that research mathematicians haven’t yet hit the point where the big models are helping them on the problems they care about.
Right now I can have Claude code write a single purpose app in a couple hours complete with a nice front end, auth, db, etc. (with a little babysitting). The models solve a lot of the annoying little issues that an experienced software developer has had to solve to get out an MVP.
These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)
1 reply →
> So what do we really learn?
We will learn if the magical capabilities attributed to these tools are really true or not. Capabilities like they can magically solve any math problem out there. This is important because AI hype is creating the narrative that these tools can solve PhD level problems and this will dis-infect that narrative. In my book, any tests that refute and dispel false narratives make a huge contribution.
1 reply →
the last unsolved erdos problem proof generated by llms that hit the news was so non interesting that a paper published by erdos himself stated the proof
aaaaaaand no one cared enough to check
so i think the question is, are those interesting by themselves, or, are they just non interesting problems no one will ever care about except it would be indicative llms are good for solving complex novel problems that do not exists in their training set?
The timed-reveal aspect is also interesting.
How is that interesting for a scientific point of view? This seems more like a social experiment dressed as science.
Science should be about reproducibility, and almost nothing here is reproducible.
> Science should be about reproducibility, and almost nothing here is reproducible.
I can see your frustration. You are looking for reproducible "benchmarks". But you have to realize several things.
1) research level problems are those that bring the "unknown" into the "known" and as such are not reproducible. That is why "creativity" has no formula. There are no prescribed processes or rules for "reproducing" creative work. If there were, then they would not be considered "research".
2) things learnt and trained are already in the realm of the "known", ie, boiler-plate, templated and reproducible.
The problems in 2) above are where LLMs excel, but they have been hyped into excelling at 1) as well. And this experiment is trying to test that hypothesis.
Deepmind’s Nobel Prize was primarily for its performance in CASP which is pretty much exactly this. Labs solve structures of proteins, but don’t publish them until after all the computational teams predict structures.
So I’m not sure where you’re coming from claiming that this isn’t scientific.
1 reply →
Reproducibility is just one aspect of science, logic + reasoning from principles and data is the major aspect.
There are some experiments which cannot be carried out more than once.
1 reply →