Comment by blendergeek

6 days ago

21 comments

blendergeek

raincole 6 days ago

Note that it's two different things:

This OP claims the publicly available models all failed to get Bronze.

OpenAI tweet claims there is an unreleased model that can get Gold.

sigmoid10 6 days ago

I'd also be highly wary of the method they used because of statements like this:
>we note that the vast majority of its answers simply stated the final answer without additional justification
While the reasoning steps are obviously important for judging human participant answers, none of the current big-game providers disclose their actual reasoning tokens. So unless they got direct internal access to these models from the big companies (which seems highly unlikely), this might be yet another failed study designed to (of which we have seen several in recent months, even by serious parties).
dmitrygr 6 days ago
My (unreleased) cat did even better than the OpenAI model. No you cannot see. Yes you have to trust me. Now gimme more money.
- klabb3 6 days ago
  
  Wow, that’s incredible. Cats are progressing so fast, especially unreleased cats seem to be doing much better. My two orange kitties aren’t doing well on math problems but obviously that’s because I’m not prompting the right way – any day now. If I ever get it to work, I’ll be sure to share the achievements on X, while carefully avoiding explaining how I did it or provide any data that can corroborate the claims.
- raincole 6 days ago
  
  I don't know the details (of course, it's unreleased), but note that MathArena evaluated "average of 4 attempts", and limited token usages to 64k.
  OpenAI likely had unlimited tokens, and evaluated "best of N attempts."
- amelius 6 days ago
  
  That's a claim that is far less plausible. OpenAI could have thrown more resources at the problem and I would be surprised if that didn't improve the results.
bgwalter 6 days ago

The model did not fit in the margin.
We'll never know how many GPUs and other assistance (like custom code paths) this model got.

untitled2 6 days ago

Exactly. Whom to believe?

JohnKemeny 6 days ago
The last time someone claimed a medal in an olympiad like this, turned out they manually translated the problem into Lean and then ran a brute force search algorithm to find a proof. For 60 hours. On a supercomputer.
Meanwhile high schoolers get a piece of paper and 4.5 hours.
- wslh 6 days ago
  
  Even though chess is now effectively solved against human players, I still remember Kasparov's suspicion that one of Deep Blue's moves had a human touch. It was never proven or disproven, but I trust Kasparov's deep intuition amplified by Kasparov requesting access to Deep Blue’s logs, and IBM refusing to share them in full. For more discussions see [1][2][3].
  [1] https://chess.stackexchange.com/questions/9959/did-deep-blue...
  [2] https://nautil.us/why-the-chess-computer-deep-blue-played-li...
  [3] https://en.chessbase.com/post/deep-blue-s-cheating-move
- throwawaymaths 6 days ago
  
  kinda wild that an llm cant translate to lean?
changoplatanero 6 days ago
Both are true. One spent $400 in compute and the other one spent a lot more.
- masterjack 6 days ago
  
  Exactly. And presumably had a more sophisticated harness around the model, longer reasoning chains, best of N, self judging, etc
kenjackson 6 days ago
OpenAI achieved Gold on an unreleased model. GPT-5. Read the tweets and they explain what they did.
- idiotsecant 6 days ago
  
  Actually, I did it a year ago but I just don't want to release my model.
  
  4 replies →
- e1g 6 days ago
  
  OpenAI explicitly said it’s not GPT-5 but another experimental research model https://x.com/alexwei_/status/1946477756738629827?s=46
  
  1 reply →