Comment by glerk

3 days ago

I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.

These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.

35 comments

glerk

Bridged7756 3 days ago

Mirrors my sentiment. Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.

It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.

xpe 2 days ago
> Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.
Just to get a sense for the rate of change, imagine if you took a survey. Compare what people said about AI tools... 3 years ago, 2 years ago, 1 year ago, 6 months ago. Then think about what is plausible that people will be saying in 3 months, 6 months, 9 months ...
Moving the goalposts has always happened, but it is happening faster than I've ever seen it. Many people seem to redefine their expectations on a monthly basis now. Worse, they seem to be unaware they are doing it.
Fancy search? Ok, I'll bite. Compare today's "fancy search" to what we had ~3 years ago according to your choice of metric. Here's one: minutes spent relative to information found. Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago. We don't need to argue phrasing when we can pick some prototypical tasks and compare them.
We're going to have different takes about where various AI technologies will be in these future timelines. It is much better to run to where the ball is likely to be, even if we have different ideas of where that is.
The human brain, at best, struggles to grasp even linear change. But linear change is not a good way to predict compounding technological change.
- manmal 2 days ago
  
  > Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago.
  And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.
  I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.
  There‘s a reason they let you steer GPT-Pro as it goes, now.
  
  3 replies →
- toraway 2 days ago
  
  Your quoted example to make that point isn't particularly convincing, IMO. Cursor came out in 2023 and everything on that list would be a typical use case, plus ChatGPT for the search replacement.
  Of course, it wasn't nearly as effective back then compared to current SOTA models, but none of those are hard to imagine someone recommending Cursor for anytime in 2024 or later.
  If OP instead said something like one shotting an entire line of business app with 10k LoC I would agree with your reminder about perspective. But it feels somewhat hype-y to say that goal posts are being moved "monthly" when most of their list has been possible for years.
  
  1 reply →
- ozgrakkurt 2 days ago
  
  Can you explain this literature review process?
  I don't believe you can do a same quality job with an LLM in 5 minutes.
  
  3 replies →
- Bridged7756 2 days ago
  
  You're relying on the public's sentiment as a metric. The public's sentiment is, more than often, skewed, influenced by marketing, or flat out wrong. That is not a good metric to rely on.
  Did it ever occur to you that the ever changing goalposts might have more to do with the expensive marketing campaigns of the big LLM providers?
  We could talk about what's a measurable metric and what's not. Certainly, we have not much more other than "benchmarks" of which, honestly, I don't know the veracity of, or if big LLM cheats somehow, or if the performance is even stable. The core idea is that LLMs remain able to do exactly what they were able to do back at release; text prediction. They got better in some regards, sure.
  Your example is worrisome to me. It should be to you too. You didn't write a literature review, you generated a scaffold of a literature review, with the same vices of LLM-based-writing as anything it does and still needing review and revising. I would hope rewriting to avoid your work be associated with LLM-generation. For better or worse, you still need to, normally, revise your work. For, once again, because this point seems to be difficult to grasp, a text predictor is not a reliable source of information. We make tradeoffs, sacrificing reliability for ease of use, but any real work needs human reviewing: which goes back to my first point. In this example it's doing nothing other than it being a fancy search and scaffolding tool.
  The ball is likely to be in the same place because, once again, they're text predictors. Not sentient beings, or intelligent. Still generating text, still hallucinating, probably even more so thanks to the ever increasing amount of LLM-written content on the internet and initiatives like poison fountain doing a number on the generated content.
  It's wild to me to make such claims about the rate of change of those tools. You're claiming we'll see exponential gains for those tools, I take, while completely ignoring the base set of constraints those models will, never, be able to get rid of. They only know how to produce text. They don't know, and will never really, know if it's right.
  
  1 reply →
danny_codes 3 days ago
I just don’t see how they’ll be able to make a profit. Open models have the same performance on coding tasks now. The incentives are all wrong. Why pay more for a model that’s no better and also isn’t open? It’s nonsense
- Bridged7756 2 days ago
  
  I wouldn't say the same but it's pretty close. At this point I'm convinced that they'll continue running the marketing machine and people due to FOMO will keep hopping onto whatever model anthropic releases.
- braebo 2 days ago
  
  Which open model has the same performance as Opus 4.7?
  
  2 replies →
- alex_sf 2 days ago
  
  Open models, in actual practice, don't match up to even one or two generation prior models from Anthropic/OpenAI/Google. They've clearly been trained on the benchmarks. Entirely possible it was by mistake, but it's definitely happening.
  
  2 replies →
3dfd 2 days ago
[dead]
- djeastm 2 days ago
  
  I think that's precisely why they're paying thousands of people in those other jobs to perform their tasks while collecting new data. Software was easiest because its already mostly written down, but other jobs can be quantized with enough data points. Just give it time

holoduke 2 days ago

You have to guide an ai. Not let roam freely. If you got skills to guide you can make it output high quality

glerk 2 days ago

Of course, and I feel like Codex/GPT is generally better at following instructions and implementing a step-by-step plan and at a lower cost. Opus still has an edge in writing, brainstorming, and open-ended frontend vibe-coding.
I’m definitely not coming to this from a “AI is useless” angle. I’ve been using these tools extensively over the past year and they are providing a massive productivity boost.
wallst07 2 days ago

That is 100% correct as a foundation.
However when you guide the AI as a constant, and the model behaves MUCH differently (given a baseline guide), that is where the problem lies.
It's as if your 'guidance' has to be variable on how well the model is behaving. Analogy is a junior dev who is sometimes excellent, and sometimes shows up drunk for work and you have no breathalyzer.
the_gipsy 2 days ago

> skills to guide
Is that what the soul is?

3dfd 2 days ago

[dead]

xpe 3 days ago

> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)

What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.

I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?

How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.

There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.

Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.

I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.

glerk 2 days ago
> To claim to know a company's strategy as an outsider is messy stuff.
I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.
Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
- xpe 2 days ago
  
  >>> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
  >> This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
  > I said "it seems like".
  Sorry. I take back the "presumptuous" part. But part of my concern remains: of all the things you chose to wrote, you only mentioned "the Tinder/casino intermittent reinforcement strategy". That phrase is going to draw eyeballs, and you got mine at least. As a reader, it conveys you think it is the most likely explanation. I'm trying to see if there is something there that I'm missing. How likely do you think is? Do you think it is more likely than the other three I mentioned? If so, it seems like your thinking hinges on this:
  > I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
  No incentive? Hardly. First, Anthropic is not a typical profit-maximizing entity, it a Public Benefit Corporation [1] [2]. Yes, profits matter still, but there are other factors to consider if we want to accurately predict their actions.
  Second, even if profit maximization is the only incentive in play, profit-maximizing entities can plan across different time horizons. Like I mentioned in my above comment, it would be rather myopic to damage their reputation with a strategy that I summarize as a short-term customer-squeeze strategy.
  Third, like many people here on HN, I've lived in the Bay Area, and I have first-degree connections that give me high confidence (P>80%) that key leaders at Anthropic have motivations that go much beyond mere profit maximization.
  A\'s AI safety mission is a huge factor and not the PR veneer that pessimists tend to claim. Most people who know me would view me as somewhat pessimistic and anti-corporate and P(doomy). I say this to emphasize I'm not just casting stones at people for "being negative". IMO, failing to recognize and account for Anthropic's AI safety stance isn't "informed hard-hitting pessimism" so much as "limited awareness and/or poor analysis".
  I'm not naive. That safety mission collides in a complicated way with FU money potential. Still, I'm confident (P>60%) that a significant number (>20%) of people at Anthropic have recently "cerebrated bad times" [3] i.e. cogitated futures where most humans die or lose control due to AI within ~10 to ~20 years. Being filthy rich doesn't matter much when dead or dehumanized.
  [1]: https://law.justia.com/codes/delaware/title-8/chapter-1/subc...
  [2]: https://time.com/6983420/anthropic-structure-openai-incentiv...
  [3]: Weird Al: please make "Cerebration" for us.
  
  2 replies →