Comment by hsn915

2 months ago

I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.

57 comments

hsn915

TechDebtDevin 2 months ago

I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.

In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.

layoric 2 months ago
> how any of these companies remain sustainable
They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.
- KennyBlanken 2 months ago
  
  If your house is on fire, the fact that the village are throwing firewood through the windows doesn't really mean the house will stay standing longer.
- tymscar 2 months ago
  
  Doesn’t this mean that realistically even if “the bubble never pops”, at some point money will run dry?
  Or do these people just bet on the post money world of AI?
  
  6 replies →
yahoozoo 2 months ago
https://www.wheresyoured.at/reality-check/
- holoduke 2 months ago
  
  This man (in the article) clearly hates AI. I also think he does not understand business and is not really able to predict the future.
  
  4 replies →
hijodelsol 2 months ago
If you read any work from Ed Zitron [1], they likely cannot remain sustainable. With OpenAI failing to convert into a for-profit, Microsoft being more interested in being a multi-modal provider and competing openly with OpenAI (e.g., open-sourcing Copilot vs. Windsurf, GitHub Agent with Claude as the standard vs. Codex) and Google having their own SOTA models and not relying on their stake in Anthropic, tarrifs complicating Stargate, explosion in capital expenditure and compute, etc., I would not be surprised to see OpenAI and Anthropic go under in the next years.
1: https://www.wheresyoured.at/oai-business/
- vessenes 2 months ago
  
  I see this sentiment everywhere on hacker news. I think it’s generally the result of consuming the laziest journalism out there. But I could be wrong! Are you interested in making a long bet banking your prediction? I’m interested in taking the positive side on this.
  
  2 replies →
- viraptor 2 months ago
  
  There's still the question of whether they will try to change the architecture before they die. Using RWKV (or something similar) would drop the costs quite a bit, but will require risky investment. On the other hand some experiment with diffusion text already, so it's slowly happening.

NitpickLawyer 2 months ago

> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

sensanaty 2 months ago

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.
But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic
[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530
hsn915 2 months ago
Is this something that the models from 4 months ago were not able to do?
- vessenes 2 months ago
  
  For a fair definition of able, yes. Those models had no ability to engage in a search of emails.
  What’s special about it is that it required no handholding; that is new.
  
  4 replies →
morepedantic 2 months ago
The LLMs have reached a plateau. Successive generations will be marginally better.
We're watching innovation move into the use and application of LLMs.
- the8472 2 months ago
  
  Innovation and better application of a relatively fixed amount of intelligence got us from wood spears to the moon.
  So even if the plateau is real (which I doubt given the pace of new releases and things like AlphaEvolve) and we'd only expect small fundamental improvements some "better applications" could still mean a lot of untapped potential.
  
  1 reply →

strangescript 2 months ago

I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.

hsn915 2 months ago
I'd go so far as to say Sonnet 3.5 was better than 3.7
At least I personally liked it better.
- vessenes 2 months ago
  
  I also liked it better but the aider leaderboards are clear that 3.7 was better. I found it extremely over eager as a coding agent but my guess is that it needed different prompting than 3.6

jug 2 months ago

This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.

LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.

I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.

voiper1 2 months ago

The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.

That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!

I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.

fintechie 2 months ago

It's not that it isn't better, it's actually worse. Seems like the big guys are stuck on a race to overfit for benchmarks, and this is becoming very noticeable.

sanex 2 months ago

Well to be fair it's only .3 difference.

pantsforbirds 2 months ago

It seems MUCH better at tool usage. Just had an example where I asked Sonnet 4 to split a PR I had after we had to revert an upstream commit.

I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.

brookst 2 months ago

How much have you used Claude 4?

hsn915 2 months ago
I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.
- theptip 2 months ago
  
  Great example of this general class of reasoning failure.
  “AI does badly on my test therefore it’s bad”.
  The correct question to ask is, of course, what is it good at? (For bonus points, think in terms of $/task rather than simply being dominant over humans.)
  
  4 replies →

illegally 2 months ago

Yes.

They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.

flixing 2 months ago

i think you are.

go_elmo 2 months ago

I feel like the model making a memory file to store context is more than a gimmick, no?

make3 2 months ago

the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well

hsn915 2 months ago
Hence, "plateau"
- j_maffe 2 months ago
  
  "plateau" in the sense that your tests are not capturing the improvements. If your usage isn't using its new capabilities then for you then effectively nothing changed, yes.
- rxtexit 2 months ago
  
  "I don't have anything to ask the model, so the model hasn't improved"
  Brilliant!
  I am pretty much ready to be done talking to human idiots on the internet. It is just so boring after talking to these models.
  
  1 reply →
- make3 2 months ago
  
  plateau means stopped
  
  1 reply →