← Back to context

Comment by airstrike

2 months ago

That's...ridiculously fast.

I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.

Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.

Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.

The trick to this is you've got to talk to them and share this information in the same way. I can give an example. These days my main workflow is as follows: if I have some big feature/refactor/whatever I'm going to work on I'll just start talking to o3 about it essentially as if it was a coworker and (somewhat painstakingly) paste in relevant source files it needs for context. We'll have a high-level discussion about what it is we're trying to build and how it relates to the existing code until I get the sense o3 has a clear and nuanced understanding (these discussions tend to sharpen my own understanding as well). Then, I'll ask o3 to generate an implementation plan that describes what needs to happen across the codebase in order for whatever it is to be realized. I'll then take that and hand it off to Codex, which might spend 10min executing shell commands to read source, edit files, test, etc. and then I've got a PR ready, which sometimes takes a bit more manual editing, and other times is perfectly ready to merge.

What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.

  • I would suggest trying the Continue.dev VSCode plugin for selective context injection. The plugin is Apache 2.0 licensed, and you can hook it up to any LLM API including local.

    It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.

    For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.

    • I’m pretty happy with Zed for development. I do plan on developing custom tooling around my style of workflow, but it’s not going to be part of an IDE.

  • Interesting approach, I'm definitely going to steal your wording for "generate an implementation plan that...".

    I do something similar but entirely within Cursor:

    1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do 2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly" 3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective) 4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode

    That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well

    • I spend so much time just finding/moving context pieces around these days i bought a physical macro pad and have been thinking about designing some software specifically to make this quicker, basically like rapidly finding/selecting context pieces and loading into buffers and relaying to conversation context. I think it’ll have to be backed by agentic search, voice controlled, and not sure how to best integrate with possible consumers… I dunno if that makes sense. I started building it and realized I need to think on the design a bit more so I’m building more like infrastructure pieces now.

  • This is absolutely the best way to do it. However it's also infeasible for number-of-queries-based quota like most front-ends have. And of course running through API for models like o3 and 4-opus is basically always way more expensive. Hence the desire for one-shotting stuff.

  • I find myself using a similar workflow with Aider. I'll use chat mode to plan, adjust context, enable edits, and let it go. I'll give it a broad objective and tell it to ask me questions until the requirements are clear, then a planning summary. Flipping the script is especially helpful when I'm unsure what I actually want.

  • I do the same thing, though sometimes I take one extra step to elaborate on the first implementation plan ‘in minute detail such that a weaker model could successfully implement it’, with deep research selected.

"...what is not in a codebase, and there is meaningful signal in that negative space."

Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.

So, thank you!

  • I am not certain that I agree with this. If there are alternative ways of solving a problem that we're not taken then these should be documented in comments. A mantra I try to tell myself and my colleagues is if information exists in your brain and nowhere else then write down it down _somewhere_. If I tried 5 different libraries before settling on one, then I write in comments which libraries I tried but didn't work and why. If I used a particular tool to debug a race condition then I put a link to a wiki page on how to use it in the comments. If we have one particular colleague who is an expert in some area then I write their name in a comment. Basically anything that is going to save future developers' time should be written down.

    • Agreed. IMO it's always a good idea to document design choices.

      The owner can write down the problem, a few solutions that were considered, why they were chosen/rejected, and a more detailed description of the final design. Stakeholders then review and provide feedback, and after some back and forth all eventually sign off the design. That not only serves to align the organization, but to document why things were done that way, so that future hires can get a sense of what is behind the code, and who was involved in case they have more questions.

      This was how we did things at some $BigCorps and it paid dividends.

    • What are you disagreeing with?

      Even if you do this (and it's good practice!), it is, empirically, not done in the vast majority of codebases.

      And even if you succeed with the utmost diligence, a vastly greater number of decisions (those you were not even aware of consciously, or took for granted) will remain undocumented but still be quite real in this "negative space" sense.

      1 reply →

  • Then document it. Whenever you choose one algorithm/library/tech stack but not another, write your consideration in the documents.

    • The funny thing is that I have at least a dozen comments in my current codebase where I explain in detail why certain things are not put in place or are not served via other-solution-that-might-seem-obvious.

  • I understand what negative space is in art. Can you explain how this applies to writing software ?

    • A quick example is a basic 2d game. If you’re not using an engine (just a graphic library) and you have some animations, experience will tell you to not write most of the code with numbers only. More often than not, you will write a quick vector module. Just how you will use local origin for transformations.

      But more often than not, the naive code is the result of not doing the above and just writing the feature. It technically does the job, but it’s verbose and difficult to maintain.

      So just like in drawing, you need to think holistically about the program. Every line of code should support an abstraction. And that will dictate which code to write and which to not write.

      That’s why you often see the concept of patterns in software. The code is not important. The patterns are. The whole structure more so. Code is just what shape these.

      9 replies →

That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.

That said, a negative prompt like we have in stable diffusion would still be very cool.

  • I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.

    I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".

    • Which LLM are you using? what LLM tool are you using? What's your tech stack that you're generating code for? Without sharing anything you can't, what prompts are you using?

      17 replies →

    • I've refactored some files over 6000 loc. It was necessary to do it iteratively with smaller patches. "Do not attempt to modify more than one function per iteration" It would just gloss over stuff. I would tell it repeatedly: I noticed you missed something, can you find it? I kept doing that until it couldn't find anything. Then I had to manually review and ask for more edits. Also lots of style guidelines and scope limit instructions. In the end it worked fine and saved me hours of really boring work.

    • I'll back this up. I feel constantly gaslit by people who claim they get good output.

      I was hacking on a new project and wanted to see if LLMs could write some of it. So I picked an LLM friendly language (python). I picked an LLM friendly DB setup (sqlalchemy and postgres). I used typing everywhere. I pre-made the DB tables and pydantic schema. I used an LLM-friendly framework (fastapi). I wrote a few example repositories and routes.

      I then told it to implement a really simple repository and routes (users stuff) from a design doc that gave strict requirements. I got back a steaming pile of shit. It was utterly broken. It ignored my requirements. It fucked with my DB tables. It fucked with (and broke) my pydantic. It mixed db access into routes which is against the repository pattern. Etc.

      I tried several of the best models from claude, oai, xai, and google. I tried giving it different prompts. I tried pruning unnecessary context. I tried their web interfaces and I tried cursor and windsurf and cline and aider. This was a pretty basic task I expect an intern could handle. It couldn't.

      Every LLM enthusiast I've since talked to just gives me the run-around on tooling and prompting and whatever. "Well maybe if you used this eighteenth IDE/extension." "Well maybe if you used this other prompt hack." "Well maybe if you'd used a different design pattern."

      The fuck?? Can vendors not produce a coherent set of usage guidelines? If this is so why isn't there a set of known best practices? Why can't I ever replicate this? Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

      1 reply →

They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.

  • This. Git ( / tig!) blame and log -p --stat -S SEARCHSTR are extremely powerful for understanding the what why and when about code..

  • I find most meetings I'm in nowadays are mostly noise; there's no clear "signal" that "this is the outcome", which I think is what an AI should be able to filter out.

    Of course, it'd be even better if people communicated more clearly and succinctly.

    • Maybe time to find an employer with a better culture? I rarely have meetings that I would be comfortable skipping.

A human working on an existing codebase does not have any special signal about what is _not_ in a codebase. Instead, a (good) human engineer can look at how a problem is handled and consider why it might have been done that way vs other options, then make an educated decision about whether that alternative would be an improvement. To me this seems like yet another piece of evidence that these models are not doing any "reasoning" or problem-solving.

If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.

Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.

  • I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.

    • Definitely. For now the "frontier-level" papers (working with repository-level coding maintenance) need to necessarily depend on previously (and statically) generated Code Knowledge Graphs or Snippet-Retrieval systems, which makes the scalable and fast aspects complicated, as any change in the code would represent a change in the graph, hence requiring a rebuild. But given the context limit, you need to rely on Graph queries to give relevant parts and then at the end of the day it just reads snippets instead of the full code, which makes the consistent an issue, as it can't learn from the entirety of the code.

      Papers I'm referring to (just some as example, as there're more):

      - CodexGraph [https://arxiv.org/abs/2408.03910] - Graph

      - Agentless [https://arxiv.org/abs/2407.01489] - Snippet-Retrieval

      1 reply →

But plenty of companies already do this for a decade and more

Having old shitty code base and not retaining the people who built it.

I have done that too despite the creator sitting only 100km away. Code was shit as hell tons of c&p different logic in different endpoints for logging in.

Finally it's worth it to have adrs and similar things.

A LLM could easily use its own knowledge to create a list of things to check inside the code base and generate a fact sheet and use best practices and similar knowledge to extend on it.

Just because one query might not be able to do so doesn't mean there are no ways around it

> Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space

I wonder if git history would be enough to cover this. It has alternatives tried and code that was removed at the very least.

> they will continue to be handicapped by that lack of institutional knowledge, so to speak

Until we give them access to all Jira tickets instead of just one so they know what's missing.

  • I've been thinking about adding in an agent to our Codex/Jules like platform which goes through the git history for the main files being changed, extracts the Jira ticket ID's, look through them for additional context, along with the analyzing the changes to other files in commits.

...which is why top LLM providers' web apps like ChatGPT, Claude.ai, Gemini try to nudge you to connect with Google Drive, and where appropriate, GitHub Repos. They also allow the user/dev to provide feedback to revise the results.

All the training and interaction data will help make them formidable.