Comment by _fizz_buzz_

21 days ago

So, I experimented a little bit with smaller models and the problem I faced is that it would simply not call a tool that is available, but instead just describe the tool. Is this something that Forge can help with?

Within limits, yes. Forge has escalating nudges that will tell the model effectively "stop responding with text, you MUST call a tool" vibes. If the model is emitting something like "ok, let me call the tool: [valid json tool call in the middle of prose]" then we catch it with rescue parsing.

But at the end of the day, if the model keeps responding with text, there's nothing forge can do. I've run into that failure mode for sure, even with forge.

That works well enough for all the models shown in the eval here: relatively modern 8B+ models.

But some of the older generation (mistral 7b, that sort of thing) still can't be reliably used in something like a production setting.

  • sorry if it's a stupid question, but isn't generating valid json tool call in the middle of prose the way tool calling works? what is that missing?

    • Not stupid at all!

      Some of the older models did do this (like 3.5-era ish I think), and the harness would parse the results.

      The newer way frontier has setup is structured tool calls. `tool_use` or `tool_calls`. The response is then received as a different tool_result rather than a regular message. That's a bit of the newer way of doing it.

      The failure mode in question is more the model mixing the two: "Sure, I'll read the file: {"tool": "read", "args": {"path": "foo"}}" - that'll break stuff. Other failure modes are the json not parsing when sent it as a structured call, and in some cases the model just emitting text and forgetting the tool call.