Comment by DrewADesign
16 hours ago
I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.
I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.
I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.
So you have the model write the if statements and put itself out of a job.
Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.
It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.
Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.
The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.
You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
You can build this in 10 minutes.
I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.