← Back to context

Comment by DanMcInerney

6 days ago

I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing slight improvement, I have seen that each new generation feels actively better at exactly the same tasks I gave the previous generation.

It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.

I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.

  • A really important thing is the distinction between performance and utility.

    Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.

  • Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.

  • I am finding that my ability to use it to code, aligns almost perfectly with increasing token memory.

  • yeah, the benchmarks are just a proxy. o3 was a step change where I started to really be able to build stuff I couldn't before

  • mind telling examples?

    • Not OP, but a couple of days ago I managed to vibecode my way through a small app that pulled data from a few services and did a few validation checks. By itself its not very impressive, but my input was literally "this is how the responses from endpoint A,B and C look like. This field included somewhere in A must be somewhere in the response from B, and the response from C must feature this and that from response A and B. If the responses include links, check that they exist". To my surprise, it generated everything in one go. No retry nor Agent mode churn needed. In the not so distant past this would require progressing through smaller steps, and I had to fill in tests to nudge Agent mode to not mess up. Not today.

      4 replies →

  • Okay but this has all to do with the tooling and nothing to do with the models.

    • I mostly disagree with this.

      I have been using 'aider' as my go to coding tool for over a year. It basically works the same way that it always has: you specify all the context and give it a request and that goes to the model without much massaging.

      I can see a massive improvement in results with each new model that arrives. I can do so much more with Gemini 2.5 or Claude 4 than I could do with earlier models and the tool has not really changed at all.

      I will agree that for the casual user, the tools make a big difference. But if you took the tool of today and paired it with a model from last year, it would go in circles

That would require AIME 2024 going above 100%.

There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.

Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.

If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.

  • There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high:

    "ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task

    ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task

    Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"

    - https://x.com/arcprize/status/1932535378080395332

    • I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

      Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

      19 replies →

I remember the saying that from 90% to 99% is a 10x increase in accuracy, but 99% to 99.999% is a 1000x increase in accuracy.

Even though it's a large10% increase first then only a 0.999% increase.

  • Sometimes it’s nice to frame it the other way, eg:

    90% -> 1 error per 10

    99% -> 1 error per 100

    99.99% -> 1 error per 10,000

    That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).

    • Still, for the human mind it doesn't make intuitive sense.

      I guess it's the same problem with the mind not intuitively grasping the concept of exponential growth and how fast it grows.

      2 replies →

  • I think the proper way to compare probabilities/proportions is by odds ratios. 99:1 vs 99999:1. (So a little more than 1000x.) This also lets you talk about “doubling likelihood”, where twice as likely as 1/2=1:1 is 2:1=2/3, and twice as likely again is 4:1=4/5.

  • The saying goes:

    From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.

  • What's the required computation power for those extra 9s? Is it linear, poly, or exponential?

    Imo we got to the current state by harnessing GPUs for a 10-20x boost over CPUs. Well, and cloud parallelization, which is ?100x?

    ASIC is probably another 10x.

    But the training data may need to vastly expand, and that data isn't going to 10x. It's probably going to degrade.

> I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro.

This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.

At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.

  • I have tried Claude 4.0 for agentic programming tasks, and it really outperforms Claude 3.7 by quite a bit. I don't follow the benchmarks - I find them a bit pointless - but anecdotally, Claude 4.0 can help me in a lot of situations where 3.7 would just flounder, completely misunderstand the problem and eventually waste more of my time than it saves.

    Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.

  • > We're well into the diminishing returns at this point

    Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.

    Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.

    Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.

    * GPT-1 was 117M parameters.

    * GPT-2 was 1.5B params (~10x).

    * GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).

    After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.

    But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.

    So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.

    So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.

    • > but I am assuming you mean diminishing beyond what the scaling laws predict.

      You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.

      But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.

      I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.

There's a new set of metrics that capture advances better than MMLU or it's pro version but nothing yet as standardized and specifically very few have a hidden set of tests to keep advancements from been from directional fine tuning.

It's hard to be 100% certain, but I am 90% certain that the benchmarks leveling off, at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

  • > (...) at this point, should tell us that we are really quite dumb and simply not good very good at either using or evaluating the technology (yet?).

    I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?