Comment by ryanjshaw

6 days ago

I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.

I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.

At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.

1 comment

ryanjshaw

ben_w 6 days ago

Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.

(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").