Comment by formerly_proven

1 month ago

If I had a nickel for every time I've seen a human dev disable/xfail/remove a failing test "because it's wrong" and then proceeding to break production I would have several nickels, which is not much, but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

23 comments

formerly_proven

vizzier 1 month ago

> but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

True, but it is shocking how often claude suggests just disabling or removing tests.

ewoodrich 1 month ago
The sneaky move that I hate most is when Claude (and does seem to mostly be a Claude-ism I haven’t encountered on GPT Codex or GLM) is when dealing with an external data source (API, locally polling hardware, etc) as a “helpful” fallback on failures it returns fake data in the shape of the expected output so that the rest of the code “works”.
Latest example is when I recently vibe coded a little Python MQTT client for a UPS connected to a spare Raspberry Pi to use with Home Assistant, and with a just few turns back and forth I got this extremely cool bespoke tool and felt really fun.
So I spent a while customizing how the data displayed on my Home Assistant dashboard and noticed every single data point was unchanging. It took a while to realize because the available data points wouldn’t be expected to change a whole lot on a fully charged UPS but the voltage and current staying at the exact same value to a decimal place for three hours raised my suspicions.
After reading the code I discovered it had just used one of the sample command line outputs from the UPS tool I gave it to write the CLI parsing logic. When an exception occurred in the parser function it instead returned the sample data so the MQTT portion of the script could still “work”.
Tbf Claude did eventually get it over the finish line once I clarified that yes, using real data from the actual UPS was in fact an important requirement for me in a real time UPS monitoring dashboard…
- teiferer 1 month ago
  
  Always check the code.
  It's similar to early versions of autonomous driving. You's not want to sit in the back seat with nobody at the wheel. That would get you killed guaranteed.
  
  1 reply →
- duskdozer 1 month ago
  
  Sounds to me like more evidence in favor of the idea that they're meant to play the golden retriever engineer reporting to you, the extremely intelligent manager.
icedchai 1 month ago
A coworker opened a PR full of AI slop. One of the first things I do is check if the tests pass. Of course, the didn't. I asked them to fix the tests, since there's no point in reviewing broken code.
"Fix the tests." This was interpreted literally, and assert status == 200 got changed to assert status == 500 in several locations. Some tests required more complex edits to make them "pass."
Inquiries about the tests went unanswered. Eventually the 2000 lines of slop was closed without merging.
- saghm 1 month ago
  
  After a certain point the response to low effort vibe code has to be vibe reviews. Failing tests? Bad vibes, close without merging. Much more efficient than vibe coding too, since no AI is needed.
ciaranmca 1 month ago
100%, trying a bit of an experiment like this(similar in that I mostly just care about playing around with different agents, techniques etc.) it has built out literally hundreds of tests. Dozens of which were almost pointless as it decided to mock apis. When the number of failed tests exceeded 40 it just started disabling tests.
- icedchai 1 month ago
  
  To be fair, many human developers are fond of pointless tests that mock everything to the extent that no real code is actually exercised. At least the tests are fast though.
  
  2 replies →
zephen 1 month ago

> it is shocking how often claude suggests just disabling or removing tests.
Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.
The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.
If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.

dullcrisp 1 month ago

If I had a nickel for every time I’ve seen a human being pull down their pants and defecate in the middle of the street I’d have a couple nickels. That’s not a lot but it suggests that this behavior is not LLM specific.

Tade0 1 month ago

If anything, the LLMs had to learn that from somewhere, so they're just copying human behaviour.

aspenmartin 1 month ago

I'm definitely in the camp that this browser implementation is shit, but just a reminder: agent training does involve human coding data in early stages of training to bootstrap it but in its reinforcement learning phase it does not -- it learns closer to the way AlphaGo did, self play and verifiable rewards. This is why people are very bullish on agents, there is no limit technically to how well they can learn (unlike LLMs) and we know we will reach superhuman skill, and the crucial crucial reason for this is: verifiable rewards. You have this for coding, you do not have this for e.g. creative tasks etc.
So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.

teiferer 1 month ago

Where I work, that's exceptionally rare to the point practically non-existing.

mickdarling 1 month ago

Had humans not been doing this already, I would have walked into Samsung with the demo application that was working an hour before my meeting, rather than the android app that could only show me the opening logo.

There are a lot of really bad human developers out there, too.

moregrist 1 month ago
> Entrepreneur, CEO and founder of Tomorrowish a social media DVR
So you flubbed managing a project and are now blaming your employees. Classy.
- mickdarling 1 month ago
  
  Wasn't my project to manage. That was a consulting gig. And I fired the client right after this.
- DonHopkins 1 month ago
  
  Nice blog post, gp serial entrepreneur founder bro -- what did your investors think of that?
  http://www.mickdarling.com/2019/07/26/busy-summer/
  An embedded page at landr-atlas.com says: Attention! MacOS Security Center has identified that your system is under threat. Please scan your MacOS as soon as possible to avoid more damage. Don't leave this page until you have undertaken all the suggested steps by authorised Antivirus. [OK]
  
  2 replies →