Comment by simonw

6 days ago

> How can anyone intellectually honest not see that?

The idea that they can only solve problems that they've seen before in their training data is one of these things that seems obviously true, but doesn't hold up once you consistently use them to solve new problems over time.

If you won't accept my anecdotal stories about this, consider the fact that both Gemini and OpenAI got gold medal level performance in two extremely well regarded academic competitions this year: the International Math Olympiad (IMO) and the International Collegiate Programming Contest (ICPC).

This is notable because both of those contests have brand new challenges created for them that have never been published before. They cannot be in the training data already!

16 comments

simonw

throw219080123 6 days ago

> consider the fact that both Gemini and OpenAI got gold medal level performance

Yet ChatGPT 5 imagines API functions that are not there and cannot figure out basic solutions even when pointed to the original source code of libraries on GitHub.

simonw 6 days ago

Which is why you run it in a coding agent loop using something like Codex CLI - then it doesn't matter if it imagines a non-existent function because it will correct itself when it tries to run the code.
Can you expand on "cannot figure out basic solutions even when pointed to the original source code of libraries on GitHub"? I have it do that all the time and it works really well for me (at least with modern "reasoning" models like GPT-5 and Claude 4.)
steveklabnik 6 days ago
As a human, I sometimes write code that does not compile first try. This does not mean that I am stupid, only that I can make mistakes. And the development process has guardrails against me making mistakes, namely, running the compiler.
- alickz 6 days ago
  
  Agreed
  Infallibility is an unrealistic bar to mark LLMs against
orbital-decay 6 days ago
Yes. I don't see why these have to be mutually exclusive.
- buildbot 6 days ago
  
  I feel they are mutually inclusive! I don’t think you can meaningfully create new things if you must always be 100% factually correct, because you might not know what correct is for the new thing.

blibble 6 days ago

> If you won't accept my anecdotal stories about this, consider the fact that both Gemini and OpenAI got gold medal level performance in two extremely well regarded academic competitions this year: the International Math Olympiad (IMO) and the International Collegiate Programming Contest (ICPC).

it's not a fair comparison

the competitions for humans are a display of ingenuity and intelligence because of the limited resources available to them

meanwhile for the "AI", all it does is demonstrate is that if you have a dozen billion dollar data-centres and a couple of hundred gigawatt hours, which can dedicate to brute-forcing a solution, then you can maybe match the level of one 18 year old, when you have a problem with a specific well known solution

(to be fair, a smart 18 year old)

and short of moores law lasting another 30 years, you won't be getting this from the dogshit LLMs on shatgpt.com

simonw 6 days ago
Google already released the Gemini 2.5 Deep Think model they used in ICPC as part of their $250/month "Ultra" plan.
The trend with all of these models is for the price for the same capabilities to drop rapidly - GPT-3 three years ago was over 1,000x the price of much better models today.
I'm not yet ready to bet against that trend holding for a while longer.
- blibble 6 days ago
  
  > GPT-3 three years ago was over 1,000x the price of much better models today.
  right, so only another 27 years of moores law continuing left
  > I'm not yet ready to bet against that trend holding for a while longer.
  I wouldn't expect an industry evangelist to say otherwise
  
  6 replies →

numismatically 6 days ago

"they output strings that didn't exit before" is some hardcore, uncut cope