← Back to context

Comment by Incipient

2 months ago

I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.

I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".

Which LLM are you using? what LLM tool are you using? What's your tech stack that you're generating code for? Without sharing anything you can't, what prompts are you using?

  • Was more of a general comment - I'm surprised there is significant variation between any of the frontier models?

    However, vscode with various python frameworks/libraries; dash, fastapi, pandas, etc. Typically passing the 4-5 relevant files in as context.

    Developing via docker so I haven't found a nice way for agents to work.

    • I would suggest using an agentic system like Cline, so that the LLM can wander through the codebase by itself and do research and build a "mental model" and then set up an implementation plan. The you iterate in that and hand it off for implementation. This flow works significantly better than what you're describing.

      14 replies →

I've refactored some files over 6000 loc. It was necessary to do it iteratively with smaller patches. "Do not attempt to modify more than one function per iteration" It would just gloss over stuff. I would tell it repeatedly: I noticed you missed something, can you find it? I kept doing that until it couldn't find anything. Then I had to manually review and ask for more edits. Also lots of style guidelines and scope limit instructions. In the end it worked fine and saved me hours of really boring work.

I'll back this up. I feel constantly gaslit by people who claim they get good output.

I was hacking on a new project and wanted to see if LLMs could write some of it. So I picked an LLM friendly language (python). I picked an LLM friendly DB setup (sqlalchemy and postgres). I used typing everywhere. I pre-made the DB tables and pydantic schema. I used an LLM-friendly framework (fastapi). I wrote a few example repositories and routes.

I then told it to implement a really simple repository and routes (users stuff) from a design doc that gave strict requirements. I got back a steaming pile of shit. It was utterly broken. It ignored my requirements. It fucked with my DB tables. It fucked with (and broke) my pydantic. It mixed db access into routes which is against the repository pattern. Etc.

I tried several of the best models from claude, oai, xai, and google. I tried giving it different prompts. I tried pruning unnecessary context. I tried their web interfaces and I tried cursor and windsurf and cline and aider. This was a pretty basic task I expect an intern could handle. It couldn't.

Every LLM enthusiast I've since talked to just gives me the run-around on tooling and prompting and whatever. "Well maybe if you used this eighteenth IDE/extension." "Well maybe if you used this other prompt hack." "Well maybe if you'd used a different design pattern."

The fuck?? Can vendors not produce a coherent set of usage guidelines? If this is so why isn't there a set of known best practices? Why can't I ever replicate this? Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

  • > Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

    It's possible I've published more of those than anyone else. I share links to Gists with transcripts of how I use the models all the time.

    You can browse a lot of my collection here: https://simonwillison.net/search/?q=Gist&sort=date

    Look for links that's at things like "transcript".