Comment by bloppe

12 hours ago

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

24 comments

bloppe

bartread 9 hours ago

I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

CraigJPerry 8 hours ago
Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).
You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.
- bartread 4 hours ago
  
  Yeah, I did consider moving records to shadow tables, but - because of the nature of our data - it requires moving a lot of child records as well, so it's quite a lot of additional churn in WAL, and the same for restore. And this approach has its own challenges with referential integrity.
  More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.
  Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.
  The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.
  As I say, not to diss other approaches: in a different situation I might have chosen one of them.
  My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.
  For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.
- andyferris 8 hours ago
  
  I move the record to another _index_, generally.
  It depends whether you reliably control all the DB client code, of course.
dakolli 4 hours ago
must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.
- bartread 4 hours ago
  
  This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s

sigmoid10 11 hours ago

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.

jakozaur 8 hours ago

Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

slashdev 6 hours ago

Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

imiric 6 hours ago
[flagged]
- serf 5 hours ago
  
  I have never read a snide comment on this site that i've been more repulsed by.
  I think because it's so specifically sharpened to stab at the software developer, my compatriot, one of the foremost primary populations here, rather than just an overall shitty human insult -- and timed to do so when the person opens up in an honest dialogue about what they're doing.
  But good news: every large software house i've talked to in the past two years is touching AI. As tragic as that is for a multitude of good reasons surrounding the workforce/copyright/ip/human-laziness/loss-of-skill/etc, that means imric is going to be outside of software , by their own rules, in totality in just a few short years!
  Happy days!
  
  3 replies →
- slashdev 6 hours ago
  
  We have the quietest on-call rotation of any company I've ever worked at.
  We have a high standard for code review, static verification, and tests.
  The fact that the code isn't hand-rolled artisanal code, and is generated by AI now, has so far turned out to have no impact on product quality or bugs reported.
  
  2 replies →
- aditmag 6 hours ago
  
  Tbf, as long as you really know what you're doing and have the sense to avoid falling into a spaghetti code trap, generating bigger chunks of code absolutely works and should be done. The pitfall happens when
  (a) the dev has no idea what the agent is doing (b) the dev gives overtly-broad instructions.
  If you give it specific enough tasks (not to the point where it's writing singular functions) but a general class description, you're on a good track.
- yohannparis 6 hours ago
  
  Why? Because writing code is the only measure of quality when producing tools? What about Unit and Integration Tests, UX research, and Performance tests.
  
  1 reply →

Bombthecat 10 hours ago

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewa 7 hours ago

Create it!

philbitt 16 minutes ago

[dead]

d0963319287 5 hours ago

[flagged]