Comment by xianshou
2 days ago
One trend I've noticed, framed as a logical deduction:
1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
2. Coding agents do massively better when they have a test-driven reward signal.
3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.
4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.
5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Sure enough, I see HN projects evolving in that direction.
> 3. If a problem can be framed in a way that a coding agent can solve...
This reminds me of the South Park underwear gnomes. You picked a tool and set an expectation, then just kind of hand wave over the hard part in the middle, as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
Does it sometimes take 50x effort to understand a problem and the agent well enough to get that done? Are there classes of problems where it can't be done? Are either of those concerns something you can recognize before they impact you? At commercial quality, is it an accessible skill for inexperienced people or do you need a mastery of coding, the problem domain, or the coding agent to be able to rely on it? Can teams recruit people who can reliable achieve any of this? How expensive is that talent? etc
>as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
It's not, but if you can A) make it cheap to try out different types of framings - not all of them have to work and B) automate everything else then the labor intensity of programming decreases drastically.
>At commercial quality, is it an accessible skill for inexperienced people
I'd expect the opposite, it would be an extremely inaccessible skill requiring high skill and high pay. But, if 2 people can deliver as much as 15 people at a higher quality and they're paid triple, it's still way cheaper overall.
I would still expect somebody following this development pattern to routinely discover a problem the LLM can't deal with and have to dive under the hood to fix it - digging down below multiple levels of abstraction. This would be Hard with a capital H.
We've had failed projects since long before LLMs. I think there is a tendency for people to gloss over this (3.) regardless, but working with an LLM it tends to become obvious much more quickly, without investing tens/hundreds of person-hours. I know it's not perfect, but I find a lot of the things people complain about would've been a problem either way - especially when people think they are going to go from 'hello world' to SaaS-billionaire in an hour.
I think mastery of the problem domain is still important, and until we have effectively infinite context windows (that work perfectly), you will need to understand how and when to refactor to maximize quality and relevance of data in context.
well according to xianshou's profile they work in finance so it makes sense to me that they would gloss over the hard part of programming when describing how AI is going to improve it
1 reply →
> as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem
It is imminently solvable! All that is necessary is to use a subset of language easier for the machine to understand and use in a very defined way; we could call this "coding language" or something similar. Even build tools to ensure we write this correctly (to avoid confusing the machine). Perhaps we could define our own algorithms using this "language" to help them along!
> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Not necessarily. You can get the same benefits you described in (1)-(3) by using clearly defined modules in your codebase, they don't need to be separate microservices.
I wonder if we'll see a return of the kind of interface file present in C++, Ocaml, and Ada. These files, well commented, are naturally the context window to use for reference for a module.
Even if languages don't grow them back as a first class feature, some format that is auto generated from the code and doesn't include the function bodies is really what is needed here.
Python (which I mention because it is the preferred language of LLM output) has grown stub files that would work for this:
https://peps.python.org/pep-0484/#stub-files
I guess that this usecase would be an argument to include docstrings in your Python stub files, which I hadn’t considered before.
Agreed. If the microservice does not provide any value from being isolated, it is just a function call with extra steps.
I think the argument is that the extra value provided is a small enough context window for working with an LLM. Although I'd suggest making it a library if one can manage, that gives you the desired context reduction bounded by interfaces without taking on the complexities of adding an additional microservice.
I imagine throwing a test at an LLM and saying:
> hold the component under test constant (as well as the test itself), and walk the versions of the library until you can tell me where they're compatible and where they break.
If you tried to do that with a git bisect and everything in the same codebase, you'd end up varying all three (test, component, library) which is worse science than holding two constant and varying the third would be.
2 replies →
Indeed; I think there's a strong possibility that there's certain architectural choices where LLMs can do very well, and others where they would struggle.
There are with humans, but it's inconsistent; personally I really dislike VIPER, yet I've never felt the pain others insist comes with too much in a ViewController.
Yeah, I think monorepos will be better for LLMs. Easier to refactor module boundaries as context grows or requirements change.
But practices like stronger module boundaries, module docs, acceptance tests on internal dev-facing module APIs, etc are all things that will be much more valuable for LLM consumption. (And might make things more pleasant for humans too!)
So having clear requirements, a focused purpose for software, and a clear boundary of software responsibility makes for a software development task that can be accomplished?
If only people had figured out at some point that the same thing applies when communicating to human software engineers.
If human software engineers refused to work unless those conditions were met, what a wonderful world it would be.
They do implicitly: you can only be accidentally productive without those preconditions.
1 reply →
> you can speed up development by 10x.
If you know what you are doing, then yes. If you are a domain expert and can articulate your thoughts clearly in a prompt, you will most likely see a boost—perhaps two to three times—but ten times is unlikely. And if you don't fully understand the problem, you may experience a negative effect.
I think it also depends on how much yak-shaving is involved in the domain, regardless of expertise. Whether that’s something simple like remembering the right bash incantation or something more complex like learning enough Terraform and providers to be able to spin up cloud infrastructure.
Some projects just have a lot of stuff to do around the edges and LLMs excel at that.
You don't need microservices for that, just factor your code into libraries that can fit into the context window. Also write functions that have clear inputs and outputs and don't need to know the full state of the software.
This has always been good practice anyway.
> Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
50K context is an interesting number because I think there's a lot to explore with software within an order of magnitude that size. With apologies to Richard Feynman, I call it, "There's plenty of room in the middle." My idea there is the rapid expansion of computing power during the reign of Moore's law left the design space of "medium sized" programs under-explored. These would be programs in the range of 100's of kilobytes to low megabytes.
> microservices written from scratch and interacting via cleanly defined APIs.
Introducing network calls because why? How about just factoring a monolith appropriately?
It doesn't have to be microservices. You can use modular architecture. You can use polylith. You can have boundaries in your code and mock around them.
This is a helpful breakdown of a trend, thank you
Might be a boon for test-driven development. Could turn out that AI coding is the killer app for TDD. I had a similar thought about a year ago but had forgotten, appreciate the reminder
Hey I reached out on twitter to chat :)
> 5. Therefore all new development should be ~~microservices~~ modules written from scratch and interacting via cleanly defined APIs.
We figured this out for humans almost 20 years ago. Some really good empirical research. It's the only approach to large scale software development that works.
But it requires leadership that gives a shit about the quality of their product and value long-term outcomes over short-term rewards.
By large scale do you mean large software or large amounts of developers? Because there's some absolutely massive software in terms of feature set, usefulness and even LoC (not that that is a useful measurement) etc out there made by very small teams.
I'm not sure that you've got the causal relationship the right way around here re: architecture:team size.
What does team size have to do with this? Small teams can (and should) absolutely build modularized software ...
You simply cannot build a [working/maintainable] large piece of software if everything is connected to everything and any one change may cause issues in conceptually unrelated pieces of code. As soon as your codebase is bigger than what you can fully memorize, you need modules, separation of concerns, etc.
3 replies →
[dead]