Comment by nopinsight

6 hours ago

I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.

49 comments

nopinsight

FrojoS 5 hours ago

> there's no reason to believe the progress of LLMs [...] will stop anytime soon

Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".

gdhkgdhkvff 2 hours ago
Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).
Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.
Which makes the patronizing sarcasm all that much more nauseating.
- le-mark 33 minutes ago
  
  Nausea aside, what evidence does anyone have that “super intelligence” of the sort your argument alludes to is even possible? Because that’s what we’re really talking about; greater than human intelligence on this sort of academic task. For example; When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.
  
  4 replies →
gchamonlive 2 hours ago
This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently (https://www.nature.com/articles/d41586-024-03214-7).
So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.
There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.
You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.
- coldtea 14 minutes ago
  
  >This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently.
  That's precisely what happens on the bad side of a S curve.
vessenes 2 hours ago
There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.
I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?
- 010101010101 2 hours ago
  
  Those are measuring the utility of a technological advancement by looking at usage, not the pace of advancement of said technology.
  
  2 replies →
- coldtea 13 minutes ago
  
  >There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes
  Or Roman trade volume before the Fall of Rome.
  Not to mention what you describe is not technological improvement but increase in data or money flows, not the same.
  
  1 reply →
- camdenreslink 21 minutes ago
  
  Total volume of usage is not an advancement, it’s orthogonal.
- mirmor23 1 hour ago
  
  [dead]
aspenmartin 2 hours ago

It’s more of a guess if you don’t know about things like scaling laws and RL with verification. The onus of “we’re going to saturate” anytime soon is on that claim because every measurement points to that not being true.
aurareturn 4 hours ago
He said "will stop anytime soon". He didn't say forever.
- Lionga 4 hours ago
  
  Which still makes no sense. There is the same chance we are flatlining now as that we are flatlining in e.g. 3 years or 5 years.
  
  5 replies →
scotty79 1 hour ago

It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.
jeremyjh 27 minutes ago

What the fuck does that have to do with “soon”?
holoduke 1 hour ago

Software and hardware have no limits. Theoretically would could bozons for computations and have the same amount of computation available on one cm3 of the current total computation in the entire world. Same with software. Never there was a stop on new algorithms. With LLMs there are so many parts that will get better and are not very far fetched.
Der_Einzige 27 minutes ago

This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.

Davidzheng 2 hours ago

Deep think still makes many many many more mistakes than gpt 5.5 pro on math

civvv 5 hours ago

There are many indications that model progress is slowing down, so that is not entirely accurate.

aspenmartin 2 hours ago

Please be specific because outside of anecdotal blog posts by people who don’t know what they’re talking about it’s not true. Look at scaling laws, composite benchmarks from the epoch capability index, nothing at all suggests “model progress is slowing down”
StrauXX 4 hours ago
Which indications are that?
- nicoburns 1 hour ago
  
  The cost factors on the new models compared to the old models.
  
  3 replies →
- overfeed 4 hours ago
  
  Investment dollars.
  
  1 reply →
- lionkor 3 hours ago
  
  Nobody is releasing NEW models
  
  8 replies →