Comment by marginalia_nu

2 months ago

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

24 comments

marginalia_nu

ehnto 2 months ago

Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

LPisGood 2 months ago

And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.

Implicated 2 months ago
You're holding it wrong.
Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.
ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.
To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).
Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.
- slopinthebag 2 months ago
  
  They aren't holding it wrong, it's a fundamental limitation of not writing the code yourself. You can make it easier to understand later when you review it, but you still need to put in that effort.
  
  2 replies →
- ModernMech 2 months ago
  
  Enforce conventions, be specific, and define boundaries… in English?!
  
  1 reply →
- philipp-gayret 2 months ago
  
  You are correct but developers are not yet ready to face it. The argument you'll always get is the flawed premise that it's less effort to write it yourself (While the same people work in teams that have others writing code for them every day of the week).
- marginalia_nu 2 months ago
  
  So in my experience with Opus 4.6 evaluating it in an existing code base has gone like this.
  You say "Do this thing".
  - It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.
  But still. You ask "Did you do the thing?"
  - it says oops I forgot to do that sub-thing. (+5m)
  - it fixes the sub-thing (+10m)
  You say is the change well integrated with the system?
  - It says not really, let me rehash this a bit. (+5m)
  - It irons out the wrinkles (+10m)
  You say does this follow best engineering practices, is it good code, something we can be proud of?
  - It says not really, here are some improvements. (+5m)
  - It implements the best practices (+15m)
  You say to look carefully at the change set and see if it can spot any potential bugs or issues.
  - It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)
  You ask if there's test coverage for these latest fixes?
  - It says "i forgor" and adds them. (+15m)
  Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)
  You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?
  - It says "you're absolutely right" and rewrites the functions. (+15m)
  You ask if there's test coverage for these latest fixes?
  - It says "i forgor" and adds them. (+15m)
  Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.
  Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.
  (You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)
  
  6 replies →
xeromal 2 months ago
The same as asking one of your JRs to do something except now it follows instructions a little bit better. Coding has never been about line generation and now you can POC something in a few hours instead of a few days / weeks to see if an idea is dumb.
- oblio 2 months ago
  
  LLMs can easily output overwhelming quantities of code. Junior devs couldn't really do that, not consistently.
  Scale/quantity matter.
  This industry is not mature enough for 1000x the bad code we have now. It was barely hanging on with 1x bad code.
  
  4 replies →

theshrike79 2 months ago

"Several hours"? How big are your change sets?

If a human dropped a PR on me that took "several hours" to go through (10k+ lines or non-trivial changes), I'd jump in my car and drive to the office just to specifically slap them on the back of the head ffs.

marginalia_nu 2 months ago

This was like 1K LOC? It's not the review that was slow, but the wrestling with the model to get the code to not suck.