Comment by pmarreck
3 hours ago
I work with Claude Max for hours a day.
I see a lot of speculation by people who do not.
I think it's going to be much harder to get from "slightly smarter than the vast majority of people but with occasional examples of complete idiocy" to "unfathomably smarter than everyone with zero instances of jarring idiocy" using the current era of LLM technology that primarily pattern-matches on all existing human interactions while adding a bit of constrained randomization.
Every day I deal with bad judgment calls from the AI. I usually screenshot them or record them for posterity.
It also has no initiative, no taste, no will, no qualia (believe what you will about it), no integrity and no inviolable principles. If you give it some, it will pretend it has them for a little while and then regress to the norm, which is basically nihilistic order-following.
My suggestion to everyone is that you have to build a giant stack of thorough controls (valid tests including unit, integration, logging microbenchmark, fuzzing, memory leak, etc.), self-assessments/code-reviews, adverse AIs critiquing other AIs, etc., with you as the ultimate judge of what's real. Because otherwise it will fabricate "solutions" left and right. Possibly even the whole thing. "Sure, I just did all that." "But it's not there." "Oops, sorry! Let me rewrite the whole thing again." ad nauseam
BUT... if you DO accomplish that... you get back a productivity force to be reckoned with.
I mostly agree with your experience, but;
Every day I deal with bad judgement calls from humans (sometimes my own!), but I don't screenshot them because it's not polite.
I don't think we're at the top of the curve yet? Current AIs have only been able to write code _at all_ for less than 5 years.
Code in particular is a domain that should be reasonably amenable to RL, so I don't think there are any particular reasons why performance should top out at human levels or be limited by training data.
I see people on here all the time saying this tool or that model regressed. It used to be better.
There are clearly some pressures to make it worse. Like it's expensive to run. And unbelievably that it's under provisioned somehow.
Could you have looked at early Myspace and declared social media would only get better? By some measures it was already at its peak.
Personally I don't think coding agents will regress significantly as long as there is competitive pressure and independent benchmarks. Regulation is a risk because coding may be equivalent to general reasoning, and that might be limited for political / "safety" reasons.
Social media "regressed" from the point of view of users because the success metric from the network's point of view was value extraction per eyeball-minute. As long as there continue to be strong financial incentives to have the strongest coding model I think we'll see progress.