← Back to context

Comment by londons_explore

2 days ago

Very true. Last year I at least glanced at every line of AI generated code. Now if some AI makes a 10k line program for some one-off tasks, I run the program, glance only over the output, and move on.

Especially if you're having an LLM write non-interactive scripts to calculate complex things from large datasets, glancing at the output is not enough to know if the output is remotely accurate (unless the output is so trivial you could literally do it in your head).

Case in point: I recently asked an LLM to write a pile of code to compile historical baseball stats to test betting success against the results of my hand-written code that evolves genetic algorithms. I marveled for a little while at the unbelievable improvement in EV/ROI that this script was showing could have been achieved from certain small tweaks. I only noticed after pushing a total bet that the push registered on the output as a win - and only because I was carefully staying on top of it. A single stupid recursively operating >= instead of > had caused completely nonsensical results that looked plausible.

Imagine, like, trusting a 10k loc script to give you data for something you were going to build in the physical world, and hoping an LLM hadn't made a mistake like that.

Which one-off tasks need 10k lines of code?

  • Would depend on what AI and prompt you use ultimately. Ask it to add tests (functional, E2E and unit, maybe invent a new type too), packaging, modular code and/or whatever, and you get to 10K relatively quickly with some of the more verbose LLMs out there.

    Personally it's probably the biggest struggle, trying to rein in the "spray and pray" approach LLMs typically like to take, and reducing the "patch on top of patch" syndrome too.

  • One off web app for scrubbing through some data, that, once done, will never be run again?

This is fine for one off tools and I do the same. But building long-lived "professional grade" production software this fails real quickly.

My team is using AI for most of the code, but the human review layer is crucial and unavoidable if you're interested in things like reliability, uptime, controlled feature rollouts, the integrity if your user's data, etc.

A huge factor I don’t see mentioned often enough, is the rapid increase of AI-coding in a language unknown to the dev.