Comment by bob1029

7 hours ago

Appropriate feedback is critical for good long horizon performance. The direction of feedback doesn't necessarily have to be from autonomous tools back to the LLM. It can also flow from tools to humans who then iterate the prompt / tools accordingly.

I've recently discovered that if a model gets stuck in a loop on a tool call across many different runs, it's almost certainly because of a gap in expectations regarding what the available tools do in that context, not some random model failure mode.

For example, I had a tool called "GetSceneOverview" that was being called as expected and then devolved into looping. Once I counted how many times it was looping I realized it was internally trying to pass per-item arguments in a way I couldn't see from outside the OAI API black box. I had never provided a "GetSceneObjectDetails" method (or explanation for why it doesn't exist) so it tried the next best thing foreach item returned in the overview.

I went one step further and asked the question "can the LLM just directly tell me what the tooling expectation gap is?" And sure enough it can. If you provide the model with a ReportToolIssue tool, you'll start to get these insights a lot more directly. Once I had cleared non-trivial reports of tool concerns, the looping issues all but vanished. It was catching things I simply couldn't see. The best insight was the fact that I hadn't provided parent ids for each scene object (I assumed not relevant for my test command), so it was banging its head on those tools trying to figure out the hierarchy. I didn't realize how big a problem this was until I saw it complaining about it every time I ran the experiment.