Comment by embedding-shape
1 day ago
> it does that at first, then a few messages of corrections and pointers, it forgets that constraint.
Yup, most models suffer from this. Everyone is raving about million tokens context, but none of the models can actually get past 20% of that and still give as high quality responses as the very first message.
My whole workflow right now is basically composing prompts out of the agent, let them run with it and if something is wrong, restart the conversation from 0 with a rewritten prompt. None of that "No, what I meant was ..." but instead rewrite it so the agent essentially solves it without having to do back and forth, just because of this issue that you mention.
Seems to happen in Codex, Claude Code, Qwen Coder and Gemini CLI as far as I've tested.
LLMs do a cool parlour trick; all they do is predict “what should the next word be?” But they do it so convincingly that in the right circumstances they seem intelligent. But that’s all it is; a trick. It’s a cool trick, and it has utility, but it’s still just a trick.
All these people thinking that if only we add enough billions of parameters when the LLM is learning and add enough tokens of context, then eventually it’ll actually understand the code and make sensible decisions? These same people perhaps also believe if Penn and Teller cut enough ladies in half on stage they’ll eventually be great doctors.
Yes, agreed. I find it interesting that people are saying they're building these huge multi-agent workflows since the projects I've tried it on are not necessarily huge in complexity. I've tried variety of different things re: isntructions files, etc. at this point.
So far, I haven't yet seen any demonstration of those kind of multi-agent workflows ending up with code that won't fall down over itself in some days/weeks. Most efforts so far seems to have to been focusing on producing as much code as possible, as fast as possible, while what I'd like to see, if anything, is the opposite of that.
Anytime I ask for demonstration of what the actual code looks like, when people start talking about their own "multi-agent orchestration platforms" (or whatever), they either haven't shared anything (yet), don't care at all about how the code actually is and/or the code is a horrible vibeslopped mess that contains mostly nonsense.
been experimenting with the same flow as well, it is sort of the motivation behind this project - to streamline the generate code -> detect gaps -> update spec -> implement flow.
curious to hear if you are still seeing code degradation over time?
I call this the Groundhog Day loop
That's a strange name, why? It's more like a "iterate and improve" loop, "Groundhog Day" to me would imply "the same thing over and over", but then you're really doing something wrong if that's your experience. You need to iterate on the initial prompt if you want something better/different.
I thought "iterate and improve" was exactly what Phil did.