Comment by data_maan

18 days ago

Yes, but people at those labs may be running those problems because a Fields Medalist is in the paper, and it got hype.

Not because of the problems, and not because this is new methodology.

And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.

So what do we really learn?

Ah. I think the issue is that research mathematicians haven’t yet hit the point where the big models are helping them on the problems they care about.

Right now I can have Claude code write a single purpose app in a couple hours complete with a nice front end, auth, db, etc. (with a little babysitting). The models solve a lot of the annoying little issues that an experienced software developer has had to solve to get out an MVP.

These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)

  • > These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)

    Same holds true for IMProofBench problems. This dataset shows nothing new.

> So what do we really learn?

We will learn if the magical capabilities attributed to these tools are really true or not. Capabilities like they can magically solve any math problem out there. This is important because AI hype is creating the narrative that these tools can solve PhD level problems and this will dis-infect that narrative. In my book, any tests that refute and dispel false narratives make a huge contribution.

  • > We will learn if the magical capabilities attributed to these tools are really true or not.

    They're not. We already know that. FrontierMath. Yu Tsumura's 553th problem, RealMath benchmark. The list goes on. As I said many times on this thread, there is nothing novel in this benchmark.

    This fact that this benchmark is so hyped shows that the community knows nothing, NOTHING, about prior work in this space, which makes me sad.