Comment by xpe

3 days ago

> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)

What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.

I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?

How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.

There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.

Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.

I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.

4 comments

xpe

glerk 3 days ago

> To claim to know a company's strategy as an outsider is messy stuff.

I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.

Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).

xpe 2 days ago
>>> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
>> This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
> I said "it seems like".
Sorry. I take back the "presumptuous" part. But part of my concern remains: of all the things you chose to wrote, you only mentioned "the Tinder/casino intermittent reinforcement strategy". That phrase is going to draw eyeballs, and you got mine at least. As a reader, it conveys you think it is the most likely explanation. I'm trying to see if there is something there that I'm missing. How likely do you think is? Do you think it is more likely than the other three I mentioned? If so, it seems like your thinking hinges on this:
> I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
No incentive? Hardly. First, Anthropic is not a typical profit-maximizing entity, it a Public Benefit Corporation [1] [2]. Yes, profits matter still, but there are other factors to consider if we want to accurately predict their actions.
Second, even if profit maximization is the only incentive in play, profit-maximizing entities can plan across different time horizons. Like I mentioned in my above comment, it would be rather myopic to damage their reputation with a strategy that I summarize as a short-term customer-squeeze strategy.
Third, like many people here on HN, I've lived in the Bay Area, and I have first-degree connections that give me high confidence (P>80%) that key leaders at Anthropic have motivations that go much beyond mere profit maximization.
A\'s AI safety mission is a huge factor and not the PR veneer that pessimists tend to claim. Most people who know me would view me as somewhat pessimistic and anti-corporate and P(doomy). I say this to emphasize I'm not just casting stones at people for "being negative". IMO, failing to recognize and account for Anthropic's AI safety stance isn't "informed hard-hitting pessimism" so much as "limited awareness and/or poor analysis".
I'm not naive. That safety mission collides in a complicated way with FU money potential. Still, I'm confident (P>60%) that a significant number (>20%) of people at Anthropic have recently "cerebrated bad times" [3] i.e. cogitated futures where most humans die or lose control due to AI within ~10 to ~20 years. Being filthy rich doesn't matter much when dead or dehumanized.
[1]: https://law.justia.com/codes/delaware/title-8/chapter-1/subc...
[2]: https://time.com/6983420/anthropic-structure-openai-incentiv...
[3]: Weird Al: please make "Cerebration" for us.
- glerk 2 days ago
  
  I like your style, and I appreciate you trying to get to the truth, despite us both being aware that we are engaging in persuasive writing here, so part of the rhetorical game is in what we choose to emphasize and what we choose to leave out.
  > How likely do you think this is? Do you think it is more likely than the other three I mentioned?
  I won't write down probability estimates, because frankly, I have no idea. Unless you are yourself a decision-maker at Anthropic, which, from what I can infer, you aren't, both of us are speculating. However, I can try to address each of your explanations at face value, because I don't think any of them makes Anthropic look any better than the explanation I provided.
  > (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.
  As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower. However, there is a very widespread rumor that Anthropic is routing traffic to quantized models during peak times to help decrease costs. Boris Cherny, Thariq Shihipar, and others have repeatedly denied this is happening [1]. I would be more concerned if this were the actual explanation, because as a user of the Claude Code Max plan and of the API, I have the expectation that each dollar I spend buys me access to the same model without opaque routing in the background.
  > (2) Another possibility is that they are not as tuned to what customers want relative to what their engineers want.
  There is actually a strong case for this: the high performance on the benchmarks relative to the qualitatively low performance reported on real-world tasks after launch. I suspect quite a bit of RL training was spent optimizing for beating those benchmarks, which resulted in overfitting the model on particular kinds of tasks. I'm not claiming this is nefarious in any way or that it is something only Anthropic is guilty of doing: these benchmarks are supposed to be a good representation of general software tasks, and using them as a training ground is expected.
  > (3) It is also possible they have slowed their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos).
  This would be the most concerning to me. I don't want to get too deeply into a political/philosophical argument, but I am very much on the other side of the e/accy vs. P(doomy) debate, and I strongly believe that keeping these tools under the control of some council of enlightened elders who claim to know what is best for humanity is ultimately futile.
  If the result of the behind-the-scenes "cerebration" is an actual effort to try and slow down AI development or access, I don't have much confidence in the future of Anthropic.
  I agree that there are incentives other than pure profit maximization here (I don't want to get into "my friend at Anthropic told me such and such" games, but I also believe this is the case). I'm sure there is some tension between these objectives inside Anthropic, but what is interesting is that lower model quality and maximizing user engagement could, at least in principle, align with both constraints.
  [1] https://x.com/trq212/status/2043023892579766290
  
  1 reply →