← Back to context

Comment by data_maan

16 days ago

> What do you mean ? These are top-notch mathematicians

YeS. I didn't dispute that. I disputed that they are NOT top notch ML specialist and have made one of the worst benchmarks of 2025-2026. Benchmarks like these would have worked maybe in early 2024 at latest. The field has moved on significantly since.

And yes, many many other benchmarks don't use toy problems -- their names are just a prompt away.

> You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

They did 1) open source some of their datapoints (on a similar order of magnitude) and 2) they carried out detailed evals. Here is much to learn from their blog posts, much more than from the current dataset.

But fair. If you don't like them, have a look at IMProofBench. Have a look at the AIMO competition. Have a loom at HardMath. It's quite a landscape of datasets already.

> Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools

As mentioned above, realistic benchmarks that are bigger and better exist. Unfortunately, from a benchmarking POV, these mathematicians are the hypesters with a preprint that wouldnt even make it to the AI&Math workshops at ICML or NeurIPS.