Comment by daxfohl
4 hours ago
I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.
Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"
The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?
That’s far-fetched. It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.
> far-fetched
Remember Google?
Once it was far-fetched that they would make the search worse just to show you more ads. Now, it is a reality.
With tokens, it is even more direct. The more tokens users spend, the more money for providers.
> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise.
Unless you’re paying by the token.
The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.
It’s not that crazy. It could even happen by accident in pursuit of another unrelated goal. And if it did, a decent chunk of the tech industry would call it “revealed preference” because usage went up.
1 reply →
To be clear I don't think that's what they're doing intentionally. Especially on a subscription basis, they'd rather me maximize my value per token, or just not use them. Lulling users into using tokens unproductively is the worst possible option.
The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".
This is a good point. For example if you have access to a bunch of slot machines, one of them is guaranteed to hit the jackpot. Since switching from one slot machine to another is easy, it is trivial to go from machine to machine until you hit the big bucks. That is why casinos have such large selections of them (for our benefit).
1 reply →
As a rational consumer, how would you distinguish between some intentional "keep pulling the slot machine" failure rate and the intrinsic failure rate?
I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.
In a market for lemons, good luck finding a lime.
1 reply →
> These can be subtle because you're not looking for them
After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.
You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.
you are using it on easy questions. some of us are not.
I think a lot of it comes down to how well the user understands the problem, because that determines the quality of instructions and feedback given to the LLM.
For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.
2 replies →
Don’t use it for hard questions like this then; you wouldn’t use a hammer to cut a plank, you’d try to make a saw instead