Comment by majormajor

2 days ago

The question is: can an LLM actually power a true "agent" or can it just create a pretty decent simulation of one? When your tools are a bigger context window and a better prompt, are there some nails that are out of your capacity to hit?

We have made LLMs that need far less "prompt engineering" to give you something pretty-decent than they did 2 years ago. It makes them WAY more useful as tools.

But then you hit the wall like you mention, or like another poster on this thread saw: "Of course, it's not perfect. For example, it gave me some authentication code that just didn’t work." This happens to me basically daily. And then I give it the error and ask it to modify. And then that doesn't work. And I give it this error. And it suggests the previous failed attempt again.

It's often still 90% of the way there, though, so the tool is pretty valuable.

But is "training on your personal quality bar" achievable? Is there enough high-quality draining data in the world, that it can recognize as high-quality vs low? Are the fundamentals of the prediction machine the right ones to be able to understand at-generation-time "this is not the right approach for this problem" given the huge variety and complexity in so many different programming languages and libraries?

TBD. But I'm a skeptic about that because I've seen "output from a given prompt" improve a ton in 2 years, but I haven't seen that same level of improvement for "output after getting a really really good prompt and some refinement instructions". I have to babysit it less, so I actually use it day to day way more, but it hits the wall in the same sort of very similar, unsurprising ways. (It's harder to describe than that - it's like a "know it when you see it" thing. "Ah, yes, there's a subtly that it doesn't know how to get past because there are so many wrinkles in a particular OAUTH2 implementation, but it was so rare a case in the docs and examples that it's just looping on things that aren't working.")

(The personification of these things really fucks up the discussion. For instance, when someone tells me "no, it was probably just too lazy to figure out the right way" or "it got tired of the conversation." The chosen user-interface of the people making these tools really messes with people's perceptions of them. E.g. if LLM-suggested code that is presented as an in-line autocomplete by Copilot is wrong, people tend to be more like "ah, Copilot's not always that great, it got it wrong" but if someone asks a chatbot instead then they're much more likely to personify the outcome.)

0 comments

majormajor

No comments yet

Contribute on Hacker News ↗