← Back to context

Comment by NiloCK

1 day ago

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

  • Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

    (G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

    https://arxiv.org/html/2605.19376v1

  • > Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

    I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

    Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

    They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

    • > While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret.

      So you are saying that frontier AI labs are spending billions of dollars on datacenters as a form of marketing. And they are colluding to hide the fact that they don't need to.

      Of course they profit more if they are in front, but bleeding money to pretend to be in front is not a winning strategy. They can't fool the market if their models are not actually better, and they know this.

    • Given that tokens are supply constrained right now for Anthropic and OpenAI (especially a problem for Anthropic), stepwise efficiency advances for either would give it a leg up on the other. It would also help them better compete on price with Chinese models.

      Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.

  • >It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

    I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

    If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

    I'm curious if someone here with a stronger background in the space has a similar intuition or not.

    • Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.

      There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.

      But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.

      I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.

      8 replies →

    • It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

      The latter is much better (since you can clean up, review, update responses and filter your datasets).

      I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

      12 replies →

    • > I don't disagree, but how much of this ends up being distillation?

      A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

      3 replies →

    • > I don't disagree, but how much of this ends up being distillation?

      You don't need distillation. They already have the training sets.

      It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

      13 replies →

  • I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

    - this gets reinvented/rediscovered constantly under different names

    - it cant be trained very well (right now, will change)

    - massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

    - BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

    I follow this stuff closely, I think I know what I'm talking about (edited for formating)

    • > - this gets reinvented/rediscovered constantly under different names

      What are the different names? I haven't seen this before.

      > - it cant be trained very well (right now, will change)

      If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?

      > the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

      Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

      12 replies →

  • I second this idea: LLMs will plateau. They are already pretty good. Plus, scientists struggle to actually score their performance accurately (esp. when it comes to reasoning).

    With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).

    So I am expecting same performance at lower costs for the coming years.

  • There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

    Most software engineers will just need cheap tokens.

    But things like physics and drug discovery have no foreseeable upper bound.

    • Or governance of large organizations... There are a huge number of factors to consider, counterfactuals, studies, lots of non-obvious second and third order effects, etc. We're barely able to get basic governance without creating huge problems (low density zoning rubber stamped across the nation creating a housing crisis, for example), so the bar isn't high.

      We pay CEOs an enormous amount because a small improvement in performance of an org because of them can make a massive difference in organizational value.

    • The upper bound is limited by market size and cost of intelligence.

      Throwing more intelligence at a problem doesn’t necessarily pan out financially otherwise we wouldn’t have single underemployed biology PhD.

  • Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

    • I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.

      3 replies →

    • It's unclear it's a dead-end within 5 years.

      There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.

      Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

      Some people would pay $200 a month forever not to have to open the terminal one time...

      5 replies →

  • GRAM is another one of those "stupid specific architectures" - same as HRMs, etc. It can sort of contest LLMs at specific puzzles. It demonstrated that much. It's not a general contender with LLMs at LLM tasks.

    If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.

    But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".

  • Small models don't have enough parameters to memorize the entire internet. For very common prompts you don't notice that, but when you rely on some niche knowledge that might only appear once in the entire web, a single blogpost, a single github issue, a single pdf, you need to be lucky enough that the agent runs a web search AND it returns what you need.

    Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.

  • By pointing out the exact things that will likely happen you are oddly enough hedging against (at least some of them) happening!

    A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.

    B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

    C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!

    • > The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

      There's less room to improve in things on several fronts.

      GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.

      Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.

      If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.

  • I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

  • "It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

    What insight do you have to make this claim?

    • Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.

      I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).

      Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).

      That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.

      So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).

      8 replies →

    • 1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).

      A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.

      2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).

  • > It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks

    The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.

    I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.

    I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.

    Its coding was fine, but the solution was not the right one.

  • It is fascinating to me to see a new product category that improves so vastly year-after-year, where people commonly state that this is now the peak already.

    I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.

    This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.

    • > I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze

      The difference in progress in smaller models is far more impressive.

      Compare Gemini 3.5 Flash to a ~16B parameter model from 24 months ago.

      Compare GPT-5.5 to a frontier model 24 months ago.

      Yes, GPT-5.5 got better. At orders of magnitude smaller parameter sizes (when factoring in ACTIVE parameters) the increase is far more pronounced.

      1 reply →

  • The GRAM model is so much into my research direction, I love it. Thank you for posting it.

    Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.

    • > Where do I find papers like this?

      I got it from my Google News recs on my phone, because I've been watching a bunch of videos on YouTube about LeCun's ideas on World Models and JEPA (I think).

    • GRAM is a lot like the Multiple Drafts Model of Consciousness that Daniel Dennett proposed. I think reasearches should read more philosophy models and bring good ideas into LLM research.

      3 replies →

  • > It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years

    Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.

    • The problem is that once you reach a certain level in coding (not particularly high imo, although some would differ) the most significant improvement in your output comes from understanding requirements better and finding ways to meet requirements in productively lazy ways, bypassing busywork that seems necessary but isn't. And that's the kind of stuff you will only find from a generally intelligent model, not a code monkey that's optimized for turning requirement sheets into source code.

  • > I won't be surprised if the next gen frontier models are the last.

    the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

    • What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

      Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.

      And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.

      I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.

      Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.

      10 replies →

  • As far as it has been studied, the relationship between model size and capability is inversely logarithmic: 10x increase in params less than doubles capability.

  • There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

    Most software engineers will just need cheap tokens.

    But things like physics and drug discovery have no forseeable upper bound.

    • Within software engineering, security, reliability, and scale also seem boundless.

      Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

      Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.

      1 reply →

  • I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

    There's a lot of room for improving the smaller models at many levels of the stack.

    • This is a good point. It didn't really work on older small models but the latest crop are quite good at following instructions and paying attention to detail, they just lack a lot of the sophistication and nuance that the frontier models have these days. So they are often capable of doing very complex tasks, they just need more detailed and foolproof instructions than the larger models would.

  • surely training also gets cheaper so justifying it becomes easier?

    i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

    It seems like the best small models today are all distilled from bigger models

    Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

  • I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

    • > I would think you could create an incredibly lean and smart "just React and React Native" model.

      You can, but it's not as useful as you might think.

      It needs to at least understand 1 human language to understand your intent to implement features.

      If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.

      But most people also want it to understand human language to implement features as well.

      Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...

      And for that you need A LOT more parameters than you might expect.

      You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.

      You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).

      You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.

      If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...

      7 replies →

    • The syntax is the easier part - most programming tasks require the reasoning and understanding of a large world model to solve problems.

      Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.

      Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.

  • Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

    Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

  • And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

    • Even if quantum computing had any clear implications for LLMs (it doesn't), there is no such thing as a "consumer quantum computer" and there won't be in our lifetimes.

    • I'm assuming this is a joke, but:

      - why'd a quantum computer help running an LLM?

      - of course there'd be need for frontier companies - nobody else has the resources to train frontier models.

  • you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

    • >you just need to look at Mythos to see the jump in performance from a 10T(?) model

      Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.

      3 replies →

    • You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.

      7 replies →

  • > I won't be surprised if the next gen frontier models are the last.

    I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

    The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

    • The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.

      The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.

      I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.

      1 reply →

  • I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

    We have so many ways of optimizing:

    - continusly creating more and better training data

    - increasing parameters to 20/50/100TB

    - We still wait for Mythos access

    - We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

    - Reinforcment learning and evolutionary algortihm only started to appear

    - If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

    - We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

    - Research for Diffusion and other models is still in progress

    - Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

    - Multitoken prediction became available just a few weeks ago

    - Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

    - World models are showing great progress and we do not know yet what they will bring to the table

    - They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

    - We see more and more mulit modal models (these also consume compute)

    - N-Gram paper and co i have not seen all of these things in chinese open models

    - We don't even know yet what Meta is doing, but we do know they restarted their efforts again

    - Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

    - We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

    - We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

    - Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

    - ChatGPTs Image model 2.0 got relevant better and came out just a month ago

    I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

    Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

    There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

    I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

    • Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.

      If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.

      And that will get us up to two orders of magnitude more parameters.

      It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.

    • > There was also a research paper were they showed that a LLM can compute things.

      Can you be a little more specific than that or provide a reference?

      I assume you're not indicating universality of neural networks?

      1 reply →

  • I think the future will be enterprise clients will train their own models based on their needs and data.

  • > It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

    I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

    > It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

    Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

    • Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.

      Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.

      If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.

      2 replies →

    • > Well for one, we know for certain there is Mythos which is meaningfully better.

      Do we?

      Have you used it?

      What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.

      3 replies →

  • I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

    • 5.5 is not a generation it is a trivial iteration...

      6 is for sure happening...

      As is Gemini 4.

      It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...

      2 replies →

  • | a 60-90B model can outperform current SOTA

    My conspiracy theory is that Apple recognizes this.

    • That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!

    • I think Apple might come out ahead by pure accident. Yes, Apple often waits to enter a market until it's established but in the case of AI they tried, they tried and failed. It was never the original plan to partner with OpenAI and then later with Google (Gemini). They 100% missed the boat on AI, the question now becomes: was the boat worth taking and we are still waiting to see how that plays out.

    • You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

  • For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

    • When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.

      2 replies →

  • Yes. You and some random indigenous guy in the Amazon likely share the same intelligence but you are more capable because you have access to writing/reading, computer, car etc. Intelligence is more than raw intelligence. It's harness, skills, tools, memory etc. If you improve all the latter but keep the raw intelligence (LLM) fixed, you certainly get better results. Same with us humans.

    • Of course, I’m not trying to dismiss gains from harness, actually the opposite.

      But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.

      If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).

      So differentiating between the two (what I’m trying to do here) is really consequential!

    • Except LLMs are simulacra of actual intelligence. Frequently in a single conversation working on a single narrowly scoped task, I am both surprised by a few insights and cursing at how it can miss obvious issues. The "raw intelligence" of LLMs leaves much to be desired.

  • To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

  • In my experience, 4.7 was a noticeable step down from 4.6.

    I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.

    And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."

    Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...

    Codex is also way faster.

  • They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

  • I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

    There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

  • I'm actually currently studying this :)

    Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

    4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

    So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

  • They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

  • Same. 4.7 felt like a definite regression

    • 4.7 was just them starting on the path on getting prices in line with the actual cost

      Make it dumber. Charge more (by changing the tokenizer). Call it the latest and greatest. Reset expectations.

  • Same. 4.7 has done some incredibly stupid things.

    • I think this is a more a consequence of the introduction of adaptive thinking and removal of extended thinking, than 4.7 specifically

  • Yep, until 1st June 4.6 is still x1 on Copilot, but will jump up quite a bit in coat - 4.7 was already highly priced, and the output was frankly terrible.

    It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.

    I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.

I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.

They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

  • Same here. Went back to 4.5 and was happy I did it. The only frustration was that I can tell the model has declined compared to the first few weeks it was released.

    I also recently moved to 4.6 since I started hitting the context limit too often with my current project.

4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.

Data at https://gertlabs.com/rankings

  • "personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse

I am using Claude Code for formal verification with Lean. In my personal experience both Opus 4.7 and now what I see from first experiments with Opus 4.8 were big improvements. I was able to delegate proofs of larger theorems that their predecessors could not handle.

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.

I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

“Maybe my own tastes are saturated now”

It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.

One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.

Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.

Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.

Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.

It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.

Ar what point does my CS degree become totally useless is an open question.

  • > At what point does my CS degree become totally useless is an open question.

    Why are you people saying all these things.

    We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.

pretty spot on.

In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.

4.1 they made it much faster, so a lot of infra improvements.

4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.

4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.

4.7 they just fixed the bugs they added in 4.6. Better than 4.5.

haven't fully tested 4.8 yet.

  • > "4.6 was such a bad model,"

    It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.

  • I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.

My read - 4.7 was a tactical lobotomy to improve the average experience at the expense of peak performance; necessary due to compute pressure.

Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.

4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.

I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump

  • I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.

    • it was also astonishingly lazy. Would just ask me to write test scripts. I asked it to create simple UI buttons for testing some basic functions so I could share it with a client, and it gave me curl commands instead - and then defended it by saying that the UI is wasted work

      Frustrating because if I have a tool, I expect a tool to do what I tell it to do. Tools shouldn't have any opinions on how they should be used

Ive been using gpt 5.4 and 5.5 and honestly 5.4 is solving everything at the pace I need it. I'm the biggest bottle neck in terms of reviewing PRs and my own code. So having a model which can solve a complex task in 10 minutes vs 30 minutes doesn't really give me any meaningful improvement.

Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

  • I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

    • Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.

      1 reply →

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.

  • i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.

    • It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.

      2 replies →

How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?

A few days? A few weeks? Longer?

However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.

May be my tasks are rudimentary but the results I get with the 4.5 model are just the same as 4.7 or 4.6. it's just at the advanced models consume more tokens and and are actually loss making for my work. The incremental changes that they are making are not really that valuable. In fact I have found that even glm 5.1 is giving me something equivalent to what Opus 4.6 gives. Am I missing something that everyone else is cheering for in these small incremental model releases?

  • I wonder if it's being done to improve revenue nunbers without changing an enterprise contract? Oh what's that your token usage went up because some of your developers switched to a new model? That sounds like a you problem.

    I thinks there's a big push to get these companies in a state where they can be dumped on public markets.

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

  • I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.

    Are the dividing lines around personality? Working domains? Opinionated software stuff?

    Who knows?

  • most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

I have seen a noticeable difference between 4.6 Medium (the default, and I skipped 4.7 because of various reported issues) and 4.8 High or whatever the default is now. It's far more likely to say it doesn't know and seems to think about things a lot more, but then it also spends a lot more time reporting on what it's thought about so it takes longer for you to process the output. In particular 4.6 would say "I've spotted something a bit off here" whereas 4.8 will say "if you do this and then this and then this under these conditions then something will go wrong here". So it seems to be closer to the claimed capabilities for Mythos than previous versions.

ChatGPT 5.5 is consistently the much better model and by a large margin.

How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.

When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.

And yes, both at deep effort settings and starting from same specs...

  • 5.5 is much better than any Anthropic model. I hate both companies with passion but the Anthropic shills here are in overdrive mode. On top of it, it's cheaper.

    Greetings to the Anthropic office good sirs btw.

I think the issue with legibility comes down to the fact that most users are not using LLMs for tasks where improvements to raw reasoning abilities wouldn't help much or at all. So it's not a matter of anyone's deficiency of perception but rather a lack of any benchmark to perceive.

It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.

IME the most noticeable performance boosts are in complex multi-agent workflows.

EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.

  • i dont think theres anything particularly special about new models for that though. thats a harness improvement

    • 1mm context window is pretty big. Even if dumber, opens new avenues. For the record I don't think we ever got better than 4 and 4.1.

Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.

We're at the top of the S-curve and you're romanticizing diminishing returns with vague hints of super human capabilities and singularities.

I'm here to complain about the churn.

I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

> (it's smarter than me?)

I genuinely hope that you're joking with that statement.

Or this is a bot.

Or an ARG.

Or Art.

Help.

  • If LLMs have tough me anything, is that the average person is far more gullible than what I could have imagined.

    • That and also.. predictable. Robotic, even. Stimulus => Reaction

      Which is a shame, because people would have the potential for greatness. But instead, for a plethora of reasons and factors (internal and external) people end up as fleshy automatons sleepwalking on rails.

      Talking _extensively_ with LLMs over the last years made me understand humans a lot better, but, in hindsight, I'm not sure if that was a good thing.

dangerous thing to believe IMO The models will get better, you will notice, everyone will notice. They will get better at coding and everything else. You should plan around that.

tbh, the last 2-3 version bumps, main change has been that they take longer, and cost more/have more usage restrictions. (combined with new tooling, which eats a ton of tokens)

I'm pretty sure they're releasing 4.8 because they massively shit the bed with 4.7 and people aren't using it.

I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.

Just want to say there's no question that you're smarter than any (and every) AI.

  • I appreciate the generosity, but you're gonna want to meet me first.

    • Kind of the beauty of it is that I don't have to to know I'm right. The reason I know is that you're alive so you can do the one thing it can't ever do, which is know when to stop or give up. It would turn me and everything else in the world into paperclips repeating the same research 1,000,000 times over.

      1 reply →

> I'll never again perceive model progress

If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars

"it's smarter than me?"

You don't have to correct it dozens of times a day!? Really?

The more difficult it is for humans to consistently and accurately compare model outputs the more opportunity there is to spread FUD (Fear, Uncertainty, Doubt). Considering valuations of these companies and the astronomical investments being made, a sabotage campaign with bots or paid users on reddit, twitter, YouTube, or whatever socials could go a long way towards knocking market cap off the competition. Not saying that's happening, just saying its an obvious target. Even if the goal is not nefarious, people with a perceived bad experience are 2-3x more likely to complain. So even without bad actors involved, a new model may need to be significantly better in order to break even on the old net promoter score.

I maintian a log of tasks, prompts, related information etc. So i can repeat past workflows verbatim, and I can qualitatively say each model beyond 4.5 has been a regression, and it would not surprise me 4.8 continues the trend. Each iteration has failed at more tasks previously completed succesfully. Right now it flat out refuses to answer many benign chemistry questions, or leans into shilling to hard and ignores non industry funded studies on certain topics. I'm transitioning to deepseek as a reuslt. Cheaper by far and at this stage not strictly speaking less capable.

I'm going to assume that at some point their "targeted training and tuning" will eventually reach some sort of "max" possible simulation of next good token. At that point I think it will be interesting to see what happens and how many parameters you really need to for different verticals.

why are the models the same price?

https://platform.claude.com/docs/en/about-claude/pricing

``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens

Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok

Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```

  • Why shouldn’t they be? They are probably the same size and cost the same to run. They are not doing full training runs (eg Mythos) so don’t need to recover insane training costs.

  • Opus 4.7 and presumably 4.8 are more expensive due to a new tokenizer that translates data into more tokens per input.

Incremental gains compounds.

  • meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

    • muse-spark is beating all the Chinese text models on lmarena leaderboard FYI. Maybe you only care about coding models.

    • Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.

      1 reply →

I can tell from hearing Feynman recordings that he was smarter than my own university's physics professor, but both were smarter than me.

It's almost like they used up most of the benefits of scaling and the fundamental issues that people have been talking about with LLMs for years are real.

The inability to tell if a model is improving is, I think, a tell that the model has improved up to your level of programmatic (analytic, computational) capacity.

A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.

There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.

The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.

honestly sonnet 3.7 is still good enough for me, as long as whatever tool prompts and so on are well optimized enough between harness and model.

i still havent really noticed it per set being better

Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.

This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.

This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.

Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.